Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds “who” and “how” at every denoising step by fusing robust tracking masks with semantically rich but noisy pose heat maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 h of dual-skater footage with more than 7 000 distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a three-track benchmark centred on the DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a significant margin. Moreover, we show that a one-hour fine-tune yields convincing human-robot videos, underscoring broad generalisation to embodied-AI and HRI tasks. Extensive ablations confirm that persistent identity-action binding is critical to these gains. Together, our model, datasets, and benchmark lift CVG from single-subject choreography to compositionally controllable, multi-actor interaction, opening new avenues for digital production, simulation, and embodied intelligence. Our code, benchmark, and dataset will be released.
The leftmost is the driving video, with reference images shown above
motion transfer case 1
motion transfer case 2
motion transfer case 3
Due to file size limitations, videos are compressed and quality might be affected.
The static image in the top-left corner is the reference image. Poses estimated from the GT video are used as inputs to each baseline.
case01
case02
case03
case04
case05
case06
case07
case08
case09
case10
case11
case12
case13
case14
case15
case16
case17
Due to file size limitations, videos are compressed and quality might be affected.
Whether using real video input or SMPL motion input, it can obtain individual tracking pose and mask sequences for each person.
@article{chen2025dancetogether,
title={DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation},
author={Junhao Chen and Mingjin Chen and Jianjin Xu and Xiang Li and Junting Dong and Mingze Sun and Puhua Jiang and Hongxiang Li and Yuhang Yang and Hao Zhao and Xiaoxiao Long and Ruqi Huang},
year={2025},
eprint={2505.18078},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.18078},
}