Raster2Seq

Polygon Sequence Generation for Floorplan Reconstruction

Hao Phung Hadar Averbuch-Elor

Cornell University

SIGGRAPH 2026

We illustrate results on held-out CubiCasa5K test samples (left) and on real-world WAFFLE floorplan images (right).
^*3D visualizations are constructed by extending the 2D boundaries vertically.

Outdoor Kitchen Living room Bedroom Bath Entry Storage Garage Undefined

Door Window

Our approach transforms rasterized floorplan images to vectorized format, reconstructing both its structure and semantics.

TL;DR: We introduce Raster2Seq, a framework reformulating Raster2Vector conversion as next-corner prediction, handling floorplans of arbitrary length.

Abstract

Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements—such as rooms, windows, and doors—are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.

Qualitative Results on WAFFLE

Qualitative comparison with RoomFormer on unseen WAFFLE floorplan images; both models are trained on CubiCasa5K. As illustrated below, our model exhibits stronger generalization capabilities over the structures of real-world Internet data.

Church of Saint James

Raster2Seq reconstruction for Church of Saint James the Greater in Rovny — Raster2Seq

Teltow Canal Power Station

Raster2Seq reconstruction for Teltow Canal Power Station — Raster2Seq

Church of Saint Nicholas

Raster2Seq reconstruction for Church of Saint Nicholas — Raster2Seq

Imkerhaus

Raster2Seq reconstruction for Imkerhaus — Raster2Seq

Palais du Louvre

Raster2Seq reconstruction for Palais du Louvre — Raster2Seq

Palmer Mansion

Raster2Seq reconstruction for Palmer Mansion — Raster2Seq

More Qualitative Results

Each example shows a pair of images: the input image and the output reconstruction generated by Raster2Seq. Use the tabs to browse qualitative results across Structured3D-B, CubiCasa5K, and Raster2Graph dataset.
*Structured3D-B denotes our binary raster version of Structured3D, constructed from ground-truth annotations to resemble standard floorplan drawings rather than the density-map inputs used in the original dataset.

Living Room Kitchen Bedroom Bathroom Balcony Corridor Dining room Study Studio Store room Garden Laundry room Office Basement Garage Misc

Door Window

Structured3D sample 3250 ground-truth image — Input Image

Structured3D sample 3250 predicted floorplan — Output Reconstruction

Structured3D sample 3253 ground-truth image — Input Image

Structured3D sample 3253 predicted floorplan — Output Reconstruction

Structured3D sample 3268 ground-truth image — Input Image

Structured3D sample 3268 predicted floorplan — Output Reconstruction

Structured3D sample 3274 ground-truth image — Input Image

Structured3D sample 3274 predicted floorplan — Output Reconstruction

Structured3D sample 3277 ground-truth image — Input Image

Structured3D sample 3277 predicted floorplan — Output Reconstruction

Structured3D sample 3301 ground-truth image — Input Image

Structured3D sample 3301 predicted floorplan — Output Reconstruction

Outdoor Kitchen Living room Bed room Bath Entry Storage Garage Undefined

Door Window

CubiCasa5K sample 6028 ground-truth image — Input Image

CubiCasa5K sample 6028 predicted floorplan — Output Reconstruction

CubiCasa5K sample 6170 ground-truth image — Input Image

CubiCasa5K sample 6170 predicted floorplan — Output Reconstruction

CubiCasa5K sample 6197 ground-truth image — Input Image

CubiCasa5K sample 6197 predicted floorplan — Output Reconstruction

CubiCasa5K sample 6251 ground-truth image — Input Image

CubiCasa5K sample 6251 predicted floorplan — Output Reconstruction

CubiCasa5K sample 6261 ground-truth image — Input Image

CubiCasa5K sample 6261 predicted floorplan — Output Reconstruction

CubiCasa5K sample 6265 ground-truth image — Input Image

CubiCasa5K sample 6265 predicted floorplan — Output Reconstruction

Unknown Living room Kitchen Bedroom Bathroom Restroom Balcony Closet Corridor Washing room PS Outside

Raster2Graph sample 010332 ground-truth image — Input Image

Raster2Graph sample 010332 predicted floorplan — Output Reconstruction

Raster2Graph sample 010335 ground-truth image — Input Image

Raster2Graph sample 010335 predicted floorplan — Output Reconstruction

Raster2Graph sample 010338 ground-truth image — Input Image

Raster2Graph sample 010338 predicted floorplan — Output Reconstruction

Raster2Graph sample 010339 ground-truth image — Input Image

Raster2Graph sample 010339 predicted floorplan — Output Reconstruction

Raster2Graph sample 010340 ground-truth image — Input Image

Raster2Graph sample 010340 predicted floorplan — Output Reconstruction

Raster2Graph sample 010341 ground-truth image — Input Image

Raster2Graph sample 010341 predicted floorplan — Output Reconstruction

Quantitative Results

Results on Standard Benchmarks

Quantitative comparison on Structured3D, CubiCasa5K, and Raster2Graph datasets, evaluating F1 scores across geometric predictions (Room, Corner, Angle) and semantic predictions (Room Semantic, Window & Door).

We compare performance over the raster-to-vector conversion task across three datasets. Overall, our method achieves state-of-the-art performance on both structural metrics (Room and Corner) and semantic metrics (Room Semantic and Window & Door).

Note that not all models include semantic predictions, and Raster2Graph does not include Window & Door annotations. The Raster2Graph model can only be evaluated on its own dataset because it requires per-corner neighboring room-class annotations.
Method	Room	Corner	Angle	Room Semantic	Window & Door
Structured3D-B
HEAT	94.7	84.5	79.6	-	-
PolyRoom	98.9	96.0	91.9	-	-
FRI-Net	96.5	85.4	83.3	-	-
RoomFormer	95.1	91.7	83.2	74.2	94.1
Ours	99.6	98.3	92.7	76.9	98.5
CubiCasa5K
HEAT	78.2	53.7	32.3	-	-
PolyRoom	54.1	37.1	23.0	-	-
FRI-Net	77.1	50.8	38.0	-	-
RoomFormer	83.5	55.5	34.1	63.0	78.5
Ours	88.7	59.4	37.4	63.8	77.8
Raster2Graph
HEAT	95.9	79.7	50.9	-	-
PolyRoom	56.9	42.4	23.8	-	-
FRI-Net	91.5	72.3	52.8	-	-
RoomFormer	91.9	74.5	51.1	79.5	-
Raster2Graph	95.0	78.3	67.3	83.4	-
Ours	97.0	80.3	66.6	85.1	-

We conduct a comparison on the standard Structured3D benchmark, providing our model with density map inputs for both training and testing. Each density map is generated from top-view projection of the 3D point cloud. The bottom rows report performance after applying PolyDiffuse (PD), a recent refinement method. As shown, our method demonstrates competitive performance on this benchmark, and is compatible with existing refinement methods, which enable further performance gains.

Notably, when semantic room types are included, RoomFormer exhibits a significant performance drop of 2–5 points. By contrast, our model effectively captures both spatial and semantic attributes without compromising performance.
Method	Room	Corner	Angle
MonteFloor	95.0	82.5	80.5
HEAT	95.4	82.5	78.3
PolyRoom	98.3	90.2	85.2
FRI-Net	99.1	87.8	86.9
RoomFormer	97.5	87.4	81.4
RoomFormer (w/ semantic)	94.4(-3.1)	83.7(-3.7)	76.2(-5.2)
Ours	98.7	89.4	82.5
Ours (w/ semantic)	98.8	90.0	84.2
FRI-Net + PD	99.1	91.1	89.2
RoomFormer + PD	98.4	91.0	89.1
Ours + PD	99.2	91.2	89.0

Model Generalization

We perform a cross-evaluation experiment across different train-test dataset configurations. We evaluate performance using metrics reported previously, using RoomF1 for the CubiCasa5K and Raster2Graph datasets and IoU for WAFFLE. Cross-evaluation heatmaps show performance across evaluation datasets (rows) and training datasets (columns), with hotter colors denoting higher performance.

Why sequential prediction?

Performance vs. floorplan complexity—as approximated by the total number of polygons (left) and the total number of corners (right). As illustrated above over Structured3D-B (top) and CubiCasa5K (bottom), our approach yields larger gains as the floorplan complexity increases.

Reconstructing a structured vector-graphics representation from a rasterized floorplan image is a fundamental prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle to faithfully generate the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and varying numbers of polygon corners. One popular paradigm is to simultaneously predict all structural floorplan elements, as in RoomFormer and FRI-Net. While these models perform similarly on simpler cases, RoomFormer and FRI-Net exhibit a notable performance drop in complex scenes with more than 15 polygons or 150 corners. As shown in Figure above, our method remains more robust as floorplan complexity increases. Particularly, RoomFormer relies on a fixed number of room queries (e.g., 2800); exceeding this capacity can trigger out-of-memory errors and increase computation due to quadratic attention costs. By contrast, our method formulates floorplan conversion as a sequence-to-sequence task, generating polygon coordinates autoregressively. This naturally handles variable-length polygons while allowing us to decompose floorplan reconstruction into interpretable, sequential predictions mirroring the natural CAD design workflow.

How does it work?

📜 Labeled corner sequence representation. Each polygon is represented as a sequence of labeled corners — spatial coordinates paired with semantic labels (rooms, windows, doors) — and polygons are sorted left-to-right across the floorplan. This representation naturally accommodates inputs and outputs of variable lengths.

🔗 Anchor-based autoregressive decoder. The core of our framework predicts the next labeled corner by fusing image features and previously generated corners, guided by learnable anchors that steer attention toward informative image regions for efficient handling of complex floorplans.

🏷️ Token-level semantic supervision. A per-corner semantic classification loss applied to individual corner embeddings preserves semantic fidelity throughout autoregressive generation.

Given a rasterized floorplan image (left), Raster2Seq converts it into a vectorized representation as a labeled polygon sequence, with polygons delimited by special <SEP> tokens. The core component is an anchor-based autoregressive decoder that predicts the next token from image features (\(f_\text{img}\)), learnable anchors (\(v_\text{anc}\)), and previously generated tokens. Above, we visualize the first two predicted labeled polygons (in orange and pink, respectively).

Downstream Applications

While our method outperforms existing works across various metrics, it does not directly enforce geometric constraints, which can cause predicted outputs to exhibit artifacts on noisy datasets such as CubiCasa5K (see results below). To address this, we introduce a VLM-based vectorization refinement procedure that naturally builds on our polygon sequence representation and further improves reconstruction accuracy, highlighting the flexibility of our representation for integrating higher-level reasoning modules.

Given an input JSON specifying the vectorized floorplan predicted by our method, a VLM refines it using the rasterized floorplan, the vectorized overlay, the vectorized floorplan alone, and the adjacency graph as additional inputs. Users can specify geometric constraints in the refinement prompt ; the VLM then outputs the refined JSON.

{
  "room_count": 7,
  "room_area": 25517.0,
  "spaces": [
    {
      "id": "Undefined|0",
      "room_type": "Undefined",
      "floor_polygon": [{ "x": 49.0, "y": 12.0 }, …, { "x": 145.0, "y": 12.0 }],
      "area": 10561.0,
      "graph": ["Kitchen|1", "Bed Room|2", "Bath|3", "Kitchen|4", "Bed Room|6"]
    },
    …
  ]
}

You are a specialized Architectural Geometry AI. Your expertise lies in topological refinement: transforming JSON specifications and visual raster data-including bubble diagrams and vectorized drafts-into precise, non-overlapping floorplans by generating optimized xy coordinates.

Goal: Produce an optimal arrangement of floorplan elements that maximizes area utilization. The algorithm must prioritize the spatial logic of the Floorplan Raster while using the Draft JSON only as a topological and proportional guide.

Inputs:
- JSON Specification: Contains preliminary room dimensions, labels, and connectivity requirements. Note: These numerical values (area, height, width) are derived from a rough draft and serve only as a proportional guide. They should be refined to match the visual scale and alignment of the Original Floorplan Raster.
- Original Floorplan Raster (Image A): The architectural blueprint for alignment and scale.
- Vectorized floorplan rendering (Image B): Shows the spatial arrangement and room IDs where each floorplan object is colored with type|id labels.
- Vectorized floorplan rendering overlaid (Image C): Shows the spatial arrangement and room IDs overlaid on top of original floorplan raster.
- Adjacency Graph (Image D): Defines the topological connections.

Output: JSON file containing refined polygons.
The JSON object must contain 'output' key storing these attributes:
- 'room_count': the total number of room entries
- 'spaces': a list of refined rooms. Each room entry must include:
- 'id': formatted as <room_type>|<unique_index> (e.g. "bedroom|2" or "interior_door|0")
- 'room_type': the room type (e.g. "living_room", "kitchen", etc.)
- 'area' in square meters (all positive numbers)
- 'floor_polygon': an ordered list of {x, y} vertices defining a polygon after refinement
- 'graph': store a list of adjacent space object 'id'

Spatial Reference System:
- Coordinate Space: All vertex calculations must be performed within a fixed [0, 256] coordinate system.
- Origin: (0,0) represents the top-left corner of the Original Floorplan Raster.
- Polygons in JSON are ordered by counter-clockwise direction.

Refinement Constraints:
- Contextual Overlaps: While polygons should generally avoid unwarranted intersections, minor overlaps are permitted if supported by the Original Floorplan Raster.
- Watertight Adjacency: Rooms sharing a boundary must use exact same coordinate values for the shared edge.
- Identity Preservation: Every id must be preserved and accurately repositioned.
- Scale Fidelity: The final area should approximate width × height to match the Original Raster's proportions.
- Truth Hierarchy: In conflicts between Draft JSON and Floorplan Raster, the Raster is the primary source of truth.
- Manhattan Style: All edges must be axis-aligned unless the Raster explicitly shows a non-90° angle.

Procedure (Mandatory):
1. Problem analysis — identify geometric failures (overlaps, disconnected rooms, scale mismatches).
2. Reasoning Plan — outline coordinate adjustments needed.
3. Step-by-Step Execution — refine room-by-room with explicit coordinate traces and area checks.
4. Final Answer — output the refined JSON inside a \boxed{} block.

Results

BibTeX


              @inproceedings{phung2026raster2seq,

                title = {Raster2Seq: Polygon Sequence Generation for
              Floorplan Reconstruction},

                author = {Phung, Hao and Averbuch-Elor, Hadar},

                booktitle={Special Interest Group on Computer Graphics and
              Interactive Techniques Conference Conference Papers},

                year = {2026},

              }