Raster2Seq

Polygon Sequence Generation for Floorplan Reconstruction

Cornell University

SIGGRAPH 2026

Raster2Seq teaser image

We illustrate results on held-out CubiCasa5K test samples (left) and on real-world WAFFLE floorplan images (right).
*3D visualizations are constructed by extending the 2D boundaries vertically.

Outdoor Kitchen Living room Bedroom Bath Entry Storage Garage Undefined
Door Window

Our approach transforms rasterized floorplan images to vectorized format, reconstructing both its structure and semantics.


TL;DR: We introduce Raster2Seq, a framework reformulating Raster2Vector conversion as next-corner prediction, handling floorplans of arbitrary length.


Abstract

Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements—such as rooms, windows, and doors—are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.


Qualitative Results on WAFFLE

Qualitative comparison with RoomFormer on unseen WAFFLE floorplan images; both models are trained on CubiCasa5K. As illustrated below, our model exhibits stronger generalization capabilities over the structures of real-world Internet data.

Church of Saint James

Church of Saint James the Greater in Rovny input floorplan image
Input Image
RoomFormer reconstruction for Church of Saint James the Greater in Rovny
RoomFormer
Raster2Seq reconstruction for Church of Saint James the Greater in Rovny
Raster2Seq

Teltow Canal Power Station

Teltow Canal Power Station input floorplan image
Input Image
RoomFormer reconstruction for Teltow Canal Power Station
RoomFormer
Raster2Seq reconstruction for Teltow Canal Power Station
Raster2Seq

Church of Saint Nicholas

Church of Saint Nicholas input floorplan image
Input Image
RoomFormer reconstruction for Church of Saint Nicholas
RoomFormer
Raster2Seq reconstruction for Church of Saint Nicholas
Raster2Seq

Imkerhaus

Imkerhaus input floorplan image
Input Image
RoomFormer reconstruction for Imkerhaus
RoomFormer
Raster2Seq reconstruction for Imkerhaus
Raster2Seq

Palais du Louvre

Palais du Louvre input floorplan image
Input Image
RoomFormer reconstruction for Palais du Louvre
RoomFormer
Raster2Seq reconstruction for Palais du Louvre
Raster2Seq

Palmer Mansion

Palmer Mansion input floorplan image
Input Image
RoomFormer reconstruction for Palmer Mansion
RoomFormer
Raster2Seq reconstruction for Palmer Mansion
Raster2Seq

Quantitative Results

Results on Standard Benchmarks

Quantitative comparison on Structured3D, CubiCasa5K, and Raster2Graph datasets, evaluating F1 scores across geometric predictions (Room, Corner, Angle) and semantic predictions (Room Semantic, Window & Door).

We compare performance over the raster-to-vector conversion task across three datasets. Overall, our method achieves state-of-the-art performance on both structural metrics (Room and Corner) and semantic metrics (Room Semantic and Window & Door).

Note that not all models include semantic predictions, and Raster2Graph does not include Window & Door annotations. The Raster2Graph model can only be evaluated on its own dataset because it requires per-corner neighboring room-class annotations.
Method Room Corner Angle Room Semantic Window & Door
Structured3D-B
HEAT 94.7 84.5 79.6 - -
PolyRoom 98.9 96.0 91.9 - -
FRI-Net 96.5 85.4 83.3 - -
RoomFormer 95.1 91.7 83.2 74.2 94.1
Ours 99.6 98.3 92.7 76.9 98.5
CubiCasa5K
HEAT 78.2 53.7 32.3 - -
PolyRoom 54.1 37.1 23.0 - -
FRI-Net 77.1 50.8 38.0 - -
RoomFormer 83.5 55.5 34.1 63.0 78.5
Ours 88.7 59.4 37.4 63.8 77.8
Raster2Graph
HEAT 95.9 79.7 50.9 - -
PolyRoom 56.9 42.4 23.8 - -
FRI-Net 91.5 72.3 52.8 - -
RoomFormer 91.9 74.5 51.1 79.5 -
Raster2Graph 95.0 78.3 67.3 83.4 -
Ours 97.0 80.3 66.6 85.1 -

Model Generalization

We perform a cross-evaluation experiment across different train-test dataset configurations. We evaluate performance using metrics reported previously, using RoomF1 for the CubiCasa5K and Raster2Graph datasets and IoU for WAFFLE. Cross-evaluation heatmaps show performance across evaluation datasets (rows) and training datasets (columns), with hotter colors denoting higher performance.


Why sequential prediction?


Performance vs. floorplan complexity—as approximated by the total number of polygons (left) and the total number of corners (right). As illustrated above over Structured3D-B (top) and CubiCasa5K (bottom), our approach yields larger gains as the floorplan complexity increases.

Reconstructing a structured vector-graphics representation from a rasterized floorplan image is a fundamental prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle to faithfully generate the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and varying numbers of polygon corners. One popular paradigm is to simultaneously predict all structural floorplan elements, as in RoomFormer and FRI-Net. While these models perform similarly on simpler cases, RoomFormer and FRI-Net exhibit a notable performance drop in complex scenes with more than 15 polygons or 150 corners. As shown in Figure above, our method remains more robust as floorplan complexity increases. Particularly, RoomFormer relies on a fixed number of room queries (e.g., 2800); exceeding this capacity can trigger out-of-memory errors and increase computation due to quadratic attention costs. By contrast, our method formulates floorplan conversion as a sequence-to-sequence task, generating polygon coordinates autoregressively. This naturally handles variable-length polygons while allowing us to decompose floorplan reconstruction into interpretable, sequential predictions mirroring the natural CAD design workflow.


How does it work?


📜 Labeled corner sequence representation. Each polygon is represented as a sequence of labeled corners — spatial coordinates paired with semantic labels (rooms, windows, doors) — and polygons are sorted left-to-right across the floorplan. This representation naturally accommodates inputs and outputs of variable lengths.


🔗 Anchor-based autoregressive decoder. The core of our framework predicts the next labeled corner by fusing image features and previously generated corners, guided by learnable anchors that steer attention toward informative image regions for efficient handling of complex floorplans.


🏷️ Token-level semantic supervision. A per-corner semantic classification loss applied to individual corner embeddings preserves semantic fidelity throughout autoregressive generation.


Raster2Seq System

Given a rasterized floorplan image (left), Raster2Seq converts it into a vectorized representation as a labeled polygon sequence, with polygons delimited by special <SEP> tokens. The core component is an anchor-based autoregressive decoder that predicts the next token from image features (\(f_\text{img}\)), learnable anchors (\(v_\text{anc}\)), and previously generated tokens. Above, we visualize the first two predicted labeled polygons (in orange and pink, respectively).


Downstream Applications

While our method outperforms existing works across various metrics, it does not directly enforce geometric constraints, which can cause predicted outputs to exhibit artifacts on noisy datasets such as CubiCasa5K (see results below). To address this, we introduce a VLM-based vectorization refinement procedure that naturally builds on our polygon sequence representation and further improves reconstruction accuracy, highlighting the flexibility of our representation for integrating higher-level reasoning modules.


VLM-based Refinement

Given an input JSON specifying the vectorized floorplan predicted by our method, a VLM refines it using the rasterized floorplan, the vectorized overlay, the vectorized floorplan alone, and the adjacency graph as additional inputs. Users can specify geometric constraints in the refinement prompt ; the VLM then outputs the refined JSON.

Input JSON Structure

{
  "room_count": 7,
  "room_area": 25517.0,
  "spaces": [
    {
      "id": "Undefined|0",
      "room_type": "Undefined",
      "floor_polygon": [{ "x": 49.0, "y": 12.0 }, …, { "x": 145.0, "y": 12.0 }],
      "area": 10561.0,
      "graph": ["Kitchen|1", "Bed Room|2", "Bath|3", "Kitchen|4", "Bed Room|6"]
    },
    …
  ]
}

Refinement Prompt

You are a specialized Architectural Geometry AI. Your expertise lies in topological refinement: transforming JSON specifications and visual raster data-including bubble diagrams and vectorized drafts-into precise, non-overlapping floorplans by generating optimized xy coordinates.

Goal: Produce an optimal arrangement of floorplan elements that maximizes area utilization. The algorithm must prioritize the spatial logic of the Floorplan Raster while using the Draft JSON only as a topological and proportional guide.

Inputs:
- JSON Specification: Contains preliminary room dimensions, labels, and connectivity requirements. Note: These numerical values (area, height, width) are derived from a rough draft and serve only as a proportional guide. They should be refined to match the visual scale and alignment of the Original Floorplan Raster.
- Original Floorplan Raster (Image A): The architectural blueprint for alignment and scale.
- Vectorized floorplan rendering (Image B): Shows the spatial arrangement and room IDs where each floorplan object is colored with type|id labels.
- Vectorized floorplan rendering overlaid (Image C): Shows the spatial arrangement and room IDs overlaid on top of original floorplan raster.
- Adjacency Graph (Image D): Defines the topological connections.

Output: JSON file containing refined polygons.
The JSON object must contain 'output' key storing these attributes:
- 'room_count': the total number of room entries
- 'spaces': a list of refined rooms. Each room entry must include:
  - 'id': formatted as <room_type>|<unique_index> (e.g. "bedroom|2" or "interior_door|0")
  - 'room_type': the room type (e.g. "living_room", "kitchen", etc.)
  - 'area' in square meters (all positive numbers)
  - 'floor_polygon': an ordered list of {x, y} vertices defining a polygon after refinement
  - 'graph': store a list of adjacent space object 'id'

Spatial Reference System:
- Coordinate Space: All vertex calculations must be performed within a fixed [0, 256] coordinate system.
- Origin: (0,0) represents the top-left corner of the Original Floorplan Raster.
- Polygons in JSON are ordered by counter-clockwise direction.

Refinement Constraints:
- Contextual Overlaps: While polygons should generally avoid unwarranted intersections, minor overlaps are permitted if supported by the Original Floorplan Raster.
- Watertight Adjacency: Rooms sharing a boundary must use exact same coordinate values for the shared edge.
- Identity Preservation: Every id must be preserved and accurately repositioned.
- Scale Fidelity: The final area should approximate width × height to match the Original Raster's proportions.
- Truth Hierarchy: In conflicts between Draft JSON and Floorplan Raster, the Raster is the primary source of truth.
- Manhattan Style: All edges must be axis-aligned unless the Raster explicitly shows a non-90° angle.

Procedure (Mandatory):
1. Problem analysis — identify geometric failures (overlaps, disconnected rooms, scale mismatches).
2. Reasoning Plan — outline coordinate adjustments needed.
3. Step-by-Step Execution — refine room-by-room with explicit coordinate traces and area checks.
4. Final Answer — output the refined JSON inside a \boxed{} block.
Results
Outdoor Kitchen Living room Bedroom Bath Entry Storage Garage Undefined
Door Window

BibTeX

@inproceedings{phung2026raster2seq,
  title = {Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction},
  author = {Phung, Hao and Averbuch-Elor, Hadar},
  booktitle={Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
  year = {2026},
}