EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer

ICLR 2026

Yuxiao Yang^1,2     Hualian Sheng²     Sijia Cai^2,*     Jing Lin³
Jiahao Wang⁴     Bing Deng²     Junzhe Lu¹     Haoqian Wang^1,†     Jieping Ye^2,†

¹Tsinghua University     ²Alibaba Group     ³Nanyang Technological University     ⁴Xi'an Jiaotong University
^*Project Lead    ^†Corresponding Author

This is the official Github page of EchoMotion, the code has been released at D2I-ai.

💡 Abstract

Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Synchronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks. To facilitate the training of a model with these capabilities, we construct HuMoVe, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs. Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation.

Overview of EchoMotion. (a) The dual-modality DiT block for joint video-motion modeling. (b) Our MVS-RoPE to serve as a synchronized coordinate for dual-modal token sequence.

✨ Key Features

Joint Video & Motion Modeling: Instead of just pixels, EchoMotion learns the relationship between appearance and underlying human motion, leading to more physically plausible results.
Novel Architecture: Introduces a Dual-Branch Diffusion Transformer with MVS-RoPE for synchronized positional encoding, effectively aligning video and motion modalities.
Versatile Generation Tasks: A single unified framework supports multiple tasks:
- Text to Joint Video-and-Motion Generation
- Motion-to-Video Generation
- Video-to-Motion Prediction
New Large-Scale Dataset: We introduce HuMoVe, a high-quality dataset of ~80,000 video-motion pairs to facilitate research in this area.

🚀 Getting Started

📢 Code has been released at D2I-ai. See README.md for more details.

📊 HuMoVe Dataset

Training a model like EchoMotion requires a large-scale, high-quality dataset of paired video and motion data. We introduce HuMoVe, containing approximately 80,000 video-motion pairs.

Wide Category Coverage: Spans a diverse range of human activities.
High-Quality Annotations: Detailed text descriptions and precise SMPL motion sequences.
High-Fidelity Videos: High-resolution, clean video clips.

Due to legal compliance restrictions, we are unable to release the complete video materials. Instead, we provide the full motion processing pipeline in extract_motion.py. Text annotations can be generated using Qwen-VL-Narrator.

📝 Citation

If you find our work useful for your research, please consider citing our paper:

@article{yang2025echomotion,
  title={EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer},
  author={Yang, Yuxiao and Sheng, Hualian and Cai, Sijia and Lin, Jing and Wang, Jiahao and Deng, Bing and Lu, Junzhe and Wang, Haoqian and Ye, Jieping},
  journal={arXiv preprint arXiv:2512.18814},
  year={2025}
}

🙏 Acknowledgements

We would like to express our gratitude for the following projects and teams that were instrumental in the development of our work:

Qwen-VL-Narrator: For their excellent tool, which was used for the textual annotation of our HuMoVe dataset.
CameraHMR: For providing the robust framework used for the SMPL annotations in our dataset.
The Wan Team: For their valuable open-source models that contributed to our research.

📜 License

This project is licensed under the CC BY-NC 4.0 License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer

ICLR 2026

💡 Abstract

✨ Key Features

🚀 Getting Started

📊 HuMoVe Dataset

📝 Citation

🙏 Acknowledgements

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer

ICLR 2026

💡 Abstract

✨ Key Features

🚀 Getting Started

📊 HuMoVe Dataset

📝 Citation

🙏 Acknowledgements

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages