Vidar & Vidarc Embodied Video Foundation Model

📝 Table of Contents

🔥 News
📖 Introduction
🔧 Installation
⚡ Inference
🖊️ Citation
🙏 Acknowledgements

🔥 News

[2025.12]: Initial release of the codebase.
[2025.07]: Vidar paper released on arXiv.

📖 Introduction

Vidar: Unified Embodied Video Foundation Model for Low-Shot Generalist Manipulation

Vidar is a unified Embodied Video Diffusion Model that leverages internet-scale video priors and cross-platform robot trajectory data to address the core issues of data scarcity and platform adaptation in robot manipulation.

Vidar adopts a "Video Generation + Action Decoding" two-stage strategy, integrating two core components — the Embodied Video Diffusion Model and the Masked Inverse Dynamics Model (MIDM). It also achieves robust generalization to unknown tasks, backgrounds, and camera layouts through a Test-Time Scaling strategy with physics-aware re-ranking.

Furthermore, Vidar aligns cross-platform heterogeneous data through a unified observation space (integrating multi-view images, robot types, camera layouts, and task instructions) and employs a "General Pre-training -> Embodied Domain Pre-training -> Target Domain Fine-tuning" three-stage training process. This allows it to capture physical consistency and temporal coherence from massive unlabeled videos, ultimately achieving low-shot adaptation on new robot platforms with only about 20 minutes of human demonstration data.

Vidarc: Autoregressive Video Foundation Model for Closed-Loop Control

Vidarc is a novel Autoregressive Embodied Video Diffusion Model designed specifically for robot closed-loop control, aiming to solve the two core pain points of high latency and insufficient grounding in robot manipulation under data-scarce scenarios.

By fusing autoregressive video generation with the Masked Inverse Dynamics Model, it integrates real-time environmental feedback into the inference process, achieving low-latency, high-precision closed-loop control while maintaining strong generalization and error correction capabilities in unknown robot platforms and dynamic environments.

🔧 Installation

Run the following commands:

conda env create --file vidar.yaml
conda activate vidar
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

⚡ Inference

Download pre-trained model weights: Wan2.2, and place it in Wan2.2-TI2V-5B. Vidar/Vidarc, and place it in vidar_ckpts.

Inference with Example

# Inference with vidarc
output_dir="output/test"
python generate_causal.py \
            --task ti2v-5B \
            --size "640*736" \
            --ckpt_dir ./Wan2.2-TI2V-5B \
            --convert_model_dtype \
            --pt_dir vidar_ckpts/vidarc.pt \
            --dataset_json examples/robotwin_example.json \
            --output_dir "$output_dir"

# Inference with vidar
python generate.py \
    --task ti2v-5B \
    --size "640*736" \
    --ckpt_dir ./Wan2.2-TI2V-5B \
      --convert_model_dtype \
      --pt_dir vidar_ckpts/vidar.pt \
    --dataset_json examples/robotwin_example.json \
    --output_dir "$output_dir"

Robotwin Eval

See eval code, and setup related environment.

# clone related code
git clone https://github.com/thu-ml/vidar-robotwin.git

# read related README at vidar-robotwin dir.

🖊️ Citation

If you find this project helpful for your research, please cite our paper:

@misc{feng2025vidarembodiedvideodiffusion,
      title={Vidar: Embodied Video Diffusion Model for Generalist Manipulation}, 
      author={Yao Feng and Hengkai Tan and Xinyi Mao and Chendong Xiang and Guodong Liu and Shuhe Huang and Hang Su and Jun Zhu},
      year={2025},
      eprint={2507.12898},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.12898}, 
}
@misc{feng2025vidarcembodiedvideodiffusion,
      title={Vidarc: Embodied Video Diffusion Model for Closed-loop Control}, 
      author={Yao Feng and Chendong Xiang and Xinyi Mao and Hengkai Tan and Zuyue Zhang and Shuhe Huang and Kaiwen Zheng and Haitian Liu and Hang Su and Jun Zhu},
      year={2025},
      eprint={2512.17661},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.17661}, 
}

🙏 Acknowledgements

This project references the following open-source projects, we would like to express our gratitude:

Wan2.2

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
idm/idm		idm/idm
server		server
wan		wan
.gitignore		.gitignore
README.md		README.md
README_CN.md		README_CN.md
generate.py		generate.py
generate_causal.py		generate_causal.py
requirements.txt		requirements.txt
vidar.yaml		vidar.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vidar & Vidarc Embodied Video Foundation Model

📝 Table of Contents

🔥 News

📖 Introduction

Vidar: Unified Embodied Video Foundation Model for Low-Shot Generalist Manipulation

Vidarc: Autoregressive Video Foundation Model for Closed-Loop Control

🔧 Installation

⚡ Inference

Inference with Example

Robotwin Eval

🖊️ Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

thu-ml/vidar

Folders and files

Latest commit

History

Repository files navigation

Vidar & Vidarc Embodied Video Foundation Model

📝 Table of Contents

🔥 News

📖 Introduction

Vidar: Unified Embodied Video Foundation Model for Low-Shot Generalist Manipulation

Vidarc: Autoregressive Video Foundation Model for Closed-Loop Control

🔧 Installation

⚡ Inference

Inference with Example

Robotwin Eval

🖊️ Citation

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages