Composing People Together

Abstract

Despite recent progress, text-to-image models still struggle to generate semantically diverse and compositionally accurate multi-person interaction scenes, often collapsing to repetitive layouts, stereotypical poses, and poorly grounded interactions. In this work, we bridge this gap by introducing a dual pose--image representation that brings person-centric structural priors into pretrained diffusion transformers. Our model jointly predicts a 2D pose visualization image and its corresponding RGB image, enabling structure and appearance to co-evolve during learning. At its core, a cross-modal alignment scheme binds text, pose, and image representations, ensuring consistent grounding across modalities. Furthermore, we design an iterative scene construction scheme, progressively generating complex multi-human interactions while effectively decomposing the overall generation complexity. Extensive experiments demonstrate that our method substantially improves prompt alignment and scene diversity in multi-person image generation.

Method

The method combines a structural pose stream with role-aware cross-modal alignment, then uses pose as the state for iterative scene construction.

Dual pose-image diffusion transformer overview

Representation

Dual Pose-Image Generation

PeopleComposer encodes pose visualizations and RGB images as paired latent streams inside a diffusion transformer. Pose is predicted jointly with the image, giving the model an explicit, appearance-independent structure for human body configuration. Role-aware rotary positional encoding binds each person's text, bounding box region, pose tokens, and image tokens through a shared role index.

Generation

Iterative Scene Construction

Instead of generating all people at once, the model constructs the scene one person at a time. At each stage, the previously generated pose image represents the current scene state; the next person's description and layout are added; and the model predicts an updated pose state and image. The final-stage image is used as the generated result.

DrawWaldoWorlds Benchmark

DrawWaldoWorlds evaluates whether models can ground interaction roles and relations, not just place multiple people in a scene.

Benchmark prompt/question JSON

1

Start From Interaction Images

We build on the Waldo and Wenda test set, which contains real images annotated with human-human interactions for vision-language reasoning.

2

Create Tiered Prompts

A multimodal LLM reads each image and its interaction annotation, then writes prompts at three levels: short interaction-focused, moderately detailed, and fine-grained image-specific.

3

Evaluate With VQA Checks

Generated images are judged with yes/no questions matched to each tier. The reference image is used to construct the benchmark item, not as an input to the text-to-image model.

Reference image 012551 from Waldo and Wenda — Reference image ID 012551 from Waldo and Wenda. This image anchors prompt and question construction.

Tier A

Prompt: "Two uniformed men escort two suited men through an honor cordon as snow falls."

Question: "Does this figure show {Prompt}?"

Tier B

Prompt: "Two uniformed service members escort two suited men, each service member holding an umbrella for one of the men as they walk on the wet stone steps."

Question: "Does this figure show {Prompt}?"

Tier C

Prompt: "In the center, two men in dark suits and ties walk side by side on wet stone steps, snow falling lightly around them. The man on the left walks with an easy stride, while the older man beside him moves in step, his expression calm and composed. To their left, a uniformed escort holds an umbrella angled over the younger man, walking slightly ahead to keep pace. Another escort follows behind on the right, holding an umbrella over the older man. The four move together in a measured, formal rhythm, their umbrellas dusted with snow as they ascend."

Decomposed questions:

a. "Does the image show two men in dark suits and ties walking side by side in the center, with one uniformed escort on each side accompanying them as they walk, in a snowy setting on stone steps?"

b. "Among the four people, is there a younger man in formal attire walking in the center-left, with a uniformed escort to the left holding an umbrella over him while walking slightly ahead?"

c. "Among the four people, is there an older man walking in the center-right, with a uniformed escort following slightly behind on the right while holding an umbrella over him?"

Results Gallery

Representative Tier C examples show PeopleComposer following fine-grained interaction prompts while revealing the iterative pose-image construction process.

Interactive Visualization

We provide an interactive visualization with 50 randomly selected DrawWaldoWorlds samples, including tiered prompts, generated examples, comparisons to other models, and VQA score annotations.

Open Visualization

Swipe horizontally or use the arrows to browse examples.

Citation

@inproceedings{peng2026composing,
  title     = {Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes},
  author    = {Peng, Wenxuan and Hariharan, Bharath and Averbuch-Elor, Hadar},
  booktitle = {SIGGRAPH Conference Papers},
  year      = {2026},
  doi       = {10.1145/3799902.3811129}
}