Official repository for Lumina-Video, a preliminary tryout of the Lumina series for Video Generation
1.mp4 |
2.mp4 |
3.mp4 |
4.mp4 |
5.mp4 |
6.mp4 |
7.mp4 |
8.mp4 |
9.mp4 |
10.mp4 |
11.mp4 |
12.mp4 |
1.mp4 |
2.mp4 |
3.mp4 |
4.mp4 |
5.mp4 |
6.mp4 |
- [2025-02-10] πππ Technical Report is released! πππ
- [2025-02-09] πππ Lumina-Video is released! πππ
See INSTALL.md for detailed instructions.
T2V models
resolution | fps | max frames | Huggingface |
---|---|---|---|
960 | 24 | 96 | Alpha-VLLM/Lumina-Video-f24R960 |
Download the checkpoints before continue. You can use the following code to download the checkpoints to the ./ckpts
directory
huggingface-cli download --resume-download Alpha-VLLM/Lumina-Video-f24R960 --local-dir ./ckpts/f24R960
You can quickly run video generation using the command below:
# Example for generatingan video with 4s duration, fps=24, resolution=1248x704
python -u generate.py \
--ckpt ./ckpts/f24R960 \
--resolution 1248x704 \
--fps 24 \
--frames 96 \
--prompt "your prompt here" \
--neg_prompt "" \
--sample_config f24F96R960 # set to "f24F96R960-MultiScale" for efficient multi-scale inference
Q1: Why using the 1248x704 resolution?
A1: The resolution is originally expected to be 1280x720. However, to ensure compatibility with the largest patch size (smallest scale), both the width and height must be divisible by 32. As a result, the resolution is adjusted to 1248x704.
Q2: Does the model support flexible aspect ratio?
A2: Yes, you can use the following code for checking all usable resolutions
# Python
from imgproc import generate_crop_size_list
target_size = 960
patch_size = 32
max_num_patches = (target_size // patch_size) ** 2
crop_size_list = generate_crop_size_list(max_num_patches, patch_size)
print(crop_size_list)
Before starting the training process, two preparation steps are required to optimize training efficiency and enable motion conditioning:
- Pre-extract and cache VAE latents for video data: This significantly enhances training speed.
- Compute motion scores for videos: These are used for micro-conditioning input during training.
The code for pre-extracting and caching VAE latents can be found in the ./tools/pre_extract directory. For an example of how to run this, refer to the run.sh script.
We use UniMatch to estimate optical flow, with the average optical flow serving as the motion score. This code is primarily derived from Open-Sora, and we'd like to thank them for their excellent work!
The code for computing motion scores is available in the ./tools/unimatch directory. To see how to run it, refer to the run.sh script.
Once the data has been prepared, you're ready to start training! For an example, you can refer to the training directory, which demonstrates how to train with:
- FPS: 8
- Duration: 4 seconds
- Resolution: widthxheightβ256x256
- Training Techniques: Image-text joint training and multi-scale training applied together.
- Inference code
- Training code
@misc{luminavideo,
title={Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT},
author={Dongyang Liu and Shicheng Li and Yutong Liu and Zhen Li and Kai Wang and Xinyue Li and Qi Qin and Yufei Liu and Yi Xin and Zhongyu Li and Bin Fu and Chenyang Si and Yuewen Cao and Conghui He and Ziwei Liu and Yu Qiao and Qibin Hou and Hongsheng Li and Peng Gao},
year={2025},
eprint={2502.06782},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.06782},
}