Composing People Together

Iterative Pose-Image Generation for Multi-Person Interaction Scenes

Wenxuan Peng, Bharath Hariharan, and Hadar Averbuch-Elor

Cornell University

Accepted to SIGGRAPH 2026

Overview teaser for Composing People Together

PeopleComposer jointly generates pose and image, using pose as an evolving scene state to compose multi-person interaction scenes step by step.

Abstract

Despite recent progress, text-to-image models still struggle to generate semantically diverse and compositionally accurate multi-person interaction scenes, often collapsing to repetitive layouts, stereotypical poses, and poorly grounded interactions. In this work, we bridge this gap by introducing a dual pose--image representation that brings person-centric structural priors into pretrained diffusion transformers. Our model jointly predicts a 2D pose visualization image and its corresponding RGB image, enabling structure and appearance to co-evolve during learning. At its core, a cross-modal alignment scheme binds text, pose, and image representations, ensuring consistent grounding across modalities. Furthermore, we design an iterative scene construction scheme, progressively generating complex multi-human interactions while effectively decomposing the overall generation complexity. Extensive experiments demonstrate that our method substantially improves prompt alignment and scene diversity in multi-person image generation.

Method

The method combines a structural pose stream with role-aware cross-modal alignment, then uses pose as the state for iterative scene construction.

Dual pose-image diffusion transformer overview
Representation

Dual Pose-Image Generation

PeopleComposer encodes pose visualizations and RGB images as paired latent streams inside a diffusion transformer. Pose is predicted jointly with the image, giving the model an explicit, appearance-independent structure for human body configuration. Role-aware rotary positional encoding binds each person's text, bounding box region, pose tokens, and image tokens through a shared role index.

Iterative pose-image generation process
Generation

Iterative Scene Construction

Instead of generating all people at once, the model constructs the scene one person at a time. At each stage, the previously generated pose image represents the current scene state; the next person's description and layout are added; and the model predicts an updated pose state and image. The final-stage image is used as the generated result.

DrawWaldoWorlds Benchmark

DrawWaldoWorlds evaluates whether models can ground interaction roles and relations, not just place multiple people in a scene.

1

Start From Interaction Images

We build on the Waldo and Wenda test set, which contains real images annotated with human-human interactions for vision-language reasoning.

2

Create Tiered Prompts

A multimodal LLM reads each image and its interaction annotation, then writes prompts at three levels: short interaction-focused, moderately detailed, and fine-grained image-specific.

3

Evaluate With VQA Checks

Generated images are judged with yes/no questions matched to each tier. The reference image is used to construct the benchmark item, not as an input to the text-to-image model.

Reference image 012551 from Waldo and Wenda
Reference image ID 012551 from Waldo and Wenda. This image anchors prompt and question construction.
Tier A

Prompt: "Two uniformed men escort two suited men through an honor cordon as snow falls."

Question: "Does this figure show {Prompt}?"

Tier B

Prompt: "Two uniformed service members escort two suited men, each service member holding an umbrella for one of the men as they walk on the wet stone steps."

Question: "Does this figure show {Prompt}?"

Tier C

Prompt: "In the center, two men in dark suits and ties walk side by side on wet stone steps, snow falling lightly around them. The man on the left walks with an easy stride, while the older man beside him moves in step, his expression calm and composed. To their left, a uniformed escort holds an umbrella angled over the younger man, walking slightly ahead to keep pace. Another escort follows behind on the right, holding an umbrella over the older man. The four move together in a measured, formal rhythm, their umbrellas dusted with snow as they ascend."

Decomposed questions:

a. "Does the image show two men in dark suits and ties walking side by side in the center, with one uniformed escort on each side accompanying them as they walk, in a snowy setting on stone steps?"

b. "Among the four people, is there a younger man in formal attire walking in the center-left, with a uniformed escort to the left holding an umbrella over him while walking slightly ahead?"

c. "Among the four people, is there an older man walking in the center-right, with a uniformed escort following slightly behind on the right while holding an umbrella over him?"

Results Gallery

Representative Tier C examples show PeopleComposer following fine-grained interaction prompts while revealing the iterative pose-image construction process.

Interactive Visualization

We provide an interactive visualization with 50 randomly selected DrawWaldoWorlds samples, including tiered prompts, generated examples, comparisons to other models, and VQA score annotations.

Open Visualization

Citation

@inproceedings{peng2026composing,
  title     = {Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes},
  author    = {Peng, Wenxuan and Hariharan, Bharath and Averbuch-Elor, Hadar},
  booktitle = {SIGGRAPH Conference Papers},
  year      = {2026},
  doi       = {10.1145/3799902.3811129}
}
Meta Information
  • Prompt used: Release-prep visual refinement request for the SIGGRAPH 2026 PeopleComposer project page, with instructions to keep the page concise, professional, and previewed through S3.
  • Prompt used: Follow-up refinement request to simplify the publication header, remove media borders, keep animation looping, reorder sections to Method, Benchmark, Results Gallery, clarify benchmark construction from the paper, remove the tiered evaluation grid, and move the interactive visualization callout into Results Gallery.
  • Prompt used: Follow-up request to link Wenxuan Peng, Bharath Hariharan, and Hadar Averbuch-Elor to their personal homepages.
  • Prompt used: Follow-up visual refinement request to improve the webpage typography and overall professional style, especially balancing the title, subtitle, and author sizes.
  • Prompt used: Follow-up request to replace the webpage Abstract with the original paper abstract text from revision_sec/00abstract.tex.
  • Prompt used: Follow-up layout request to keep the main title on one line, widen the page text, and increase small typography across the publication page.
  • Prompt used: Follow-up request to reconstruct the DrawWaldoWorlds benchmark prompt/question JSON from the previous full evaluation results and link it from the project page.
  • Prompt used: Follow-up typography request to further enlarge small text, widen section leads, avoid awkward centered short final lines, and justify long Abstract/Method copy.
  • Prompt used: Follow-up request to keep the DrawWaldoWorlds Benchmark lead on one line and center the benchmark JSON link.
  • Prompt used: Follow-up request to align the DrawWaldoWorlds Benchmark lead text with the left edge of the first benchmark construction card.
  • Working directory: /mnt/localssd/Composing-People-Together-Iterative-Pose-Image-Generation-for-Multi-Person-Interaction-Scenes-4-
  • Primary HTML file: /mnt/localssd/Composing-People-Together-Iterative-Pose-Image-Generation-for-Multi-Person-Interaction-Scenes-4-/PeopleComposer/index.html
  • Generated benchmark JSON: /mnt/localssd/Composing-People-Together-Iterative-Pose-Image-Generation-for-Multi-Person-Interaction-Scenes-4-/PeopleComposer/eval_results/drawwaldoworlds_benchmark.json
  • Paper source files read: /mnt/localssd/Composing-People-Together-Iterative-Pose-Image-Generation-for-Multi-Person-Interaction-Scenes-4-/revision_sec/03method.tex and /mnt/localssd/Composing-People-Together-Iterative-Pose-Image-Generation-for-Multi-Person-Interaction-Scenes-4-/revision_sec/05experiments.tex
  • Referenced assets directory: /mnt/localssd/Composing-People-Together-Iterative-Pose-Image-Generation-for-Multi-Person-Interaction-Scenes-4-/PeopleComposer/assets/
  • Referenced visualization directory: /mnt/localssd/Composing-People-Together-Iterative-Pose-Image-Generation-for-Multi-Person-Interaction-Scenes-4-/PeopleComposer/interactive_visualization/
  • External layout references consulted: https://cornell-vailab.github.io/Raster2Seq/ and SIGGRAPH-style project pages found during web search.