
VeOmni is a versatile framework for both single- and multi-modal pre-training and post-training. It empowers users to seamlessly scale models of any modality across various accelerators, offering both flexibility and user-friendliness.
Our guiding principles when building VeOmni are:
-
Flexibility and Modularity: VeOmni is built with a modular design, allowing users to decouple most components and replace them with their own implementations as needed.
-
Trainer-free: VeOmni avoids rigid, structured trainer classes (e.g., PyTorch-Lightning or HuggingFace Trainer). Instead, VeOmni keeps training scripts linear, exposing the entire training logic to users for maximum transparency and control.
-
Omni model native: VeOmni enables users to effortlessly scale any omni-model across devices and accelerators.
-
Torch native: VeOmni is designed to leverage PyTorchβs native functions to the fullest extent, ensuring maximum compatibility and performance.
- [2025/04/03] We release VeOmni.
- VeOmni: Scaling any Modality Model Training to any Accelerators with PyTorch native Training Framework
- π Overview
- π Table of Contents
- π Key Features
- π Getting Started
- 𧱠Training Examples
- βοΈ Supported Models
- β°οΈ Performance
- π Acknowledgement
- π‘ Awesome work using VeOmni
- π¨ Contributing
- π License
- π Citation
- π± About ByteDance Seed Team
- Parallelism
- Parallel state by DeviceMesh
- Torch FSDP1/2
- Experts parallelism(Experimental)
- Easy to add new parallelism plan
- Sequence parallelism
- Ulysess
- Async-Ulysses
- Activation offloading
- Activation checkpointing
- Kernels
- GroupGemm ops for moe
- Liger-Kernel integrations
- Model
- Any transformers models.
- Multi-modal
- Qwen2.5-VL
- Qwen2-VL
- Seed-Omni
- Data IO
- Dynamic batching strategy
- Omnidata processing
- Distributed Checkpointing
- ByteCheckpoint (Recommend)
- Torch Distributed checkpointing
- Dcp merge tools
- Other tools
- Profiling tools
- Easy yaml configuration and argument parsing
- veScale FSDP
- Torch native parallelism
- torch.compile
- Flux: Fine-grained Computation-communication Overlapping GPU Kernel integrations
- Better offloading strategy
- More models support
- Torch native pipeline parallelism
Read the VeOmni Best Practice for more details.
Install using PyPI:
pip3 install veomni
Install from source code:
pip3 install -e .
Install veScale(Not available yet)
git clone https://github.com/volcengine/veScale.git
pip3 install .
User can quickly start training like this:
bash train.sh $TRAIN_SCRIPT $CONFIG.yaml
You can also override arguments in yaml by passing arguments from an external command line:
bash train.sh $TRAIN_SCRIPT $CONFIG.yaml \
--model.model_path PATH/TO/MODEL \
--data.train_path PATH/TO/DATA \
--train.global_batch_size GLOBAL_BATCH_SIZE \
Here is an end-to-end workflow for preparing a subset of the fineweb dataset, continuing training a qwen2_5 model with sequence parallel 2 for 20 steps, and then merging the global_step_10 distributed checkpoint to hf weight by ByteCheckpoint.
- Download fineweb dataset
python3 scripts/download_hf_data.py \
--repo_id HuggingFaceFW/fineweb \
--local_dir ./fineweb/ \
--allow_patterns sample/10BT/*
- Download qwen2_5 model
python3 scripts/download_hf_model.py \
--repo_id Qwen/Qwen2.5-7B \
--local_dir .
- Training
bash train.sh tasks/train_torch.py configs/pretrain/qwen2_5.yaml \
--model.model_path ./Qwen2.5-7B \
--data.train_path ./fineweb/sample/10BT/ \
--train.global_batch_size 512 \
--train.lr 5e-7 \
--train.ulysses_parallel_size 2 \
--train.save_steps 10 \
--train.max_steps 20 \
--train.output_dir Qwen2.5-7B_CT
- Merge checkpoints
python3 scripts/mereg_dcp_to_hf.py \
--load-dir Qwen2.5-7B-Instruct_CT/checkpoints/global_step_10 \
--model_assets_dir Qwen2.5-7B-Instruct_CT/model_assets \
--save-dir Qwen2.5-7B-Instruct_CT/checkpoints/global_step_10/hf_ckpt
- Inference
python3 tasks/infer.py \
--infer.model_path Qwen2.5-7B-Instruct_CT/checkpoints/global_step_10/hf_ckpt
we use ByteCheckpoint to save checkpoints in torch.distributed.checkpoint(dcp) format. You can merge the dcp files using this command:
python3 scripts/mereg_dcp_to_hf.py \
--load-dir PATH/TO/CHECKPOINTS \
--model_assets_dir PATH/TO/MODEL_ASSETS \
--save-dir PATH/TO/SAVE_HF_WEIGHT \
For example, your output_dir is seed_omni
, and you want to merge global_step_100 checkpoint to huggingface-type weight:
python3 scripts/mereg_dcp_to_hf.py \
--load-dir seed_omni/checkpoints/global_step_100 \
--model_assets_dir seed_omni/model_assets \
--save-dir seed_omni/hf_ckpt \
cd docker/
docker compose up -d
docker compose exec VeOmni bash
PyTorch FSDP2 Qwen2VL
bash train.sh tasks/multimodal/omni/train_qwen2_vl.py configs/multimodal/qwen2_vl/qwen2_vl.yaml
PyTorch FSDP2 Qwen2
bash train.sh tasks/train_torch.py configs/pretrain/qwen2_5.yaml
PyTorch FSDP2 llama3-8b-instruct
bash train.sh tasks/train_torch.py configs/pretrain/llama3.yaml
Model | Model size | Example config File |
---|---|---|
DeepSeek 2.5/3/R1 | 236B/671B | deepseek.yaml |
Llama 3-3.3 | 1B/3B/8B/70B | llama3.yaml |
Qwen 2-2.5 | 0.5B/1.5B/3B/7B/14B/32B/72B/ | qwen2_5.yaml |
Qwen2-VL/Qwen2.5-VL/QVQ | 2B/3B/7B/32B/72B | qwen2_vl.yaml |
Seed_omni | any foundation model with any omni encoder&&decoder | seed_omni.yaml |
VeOmni Support all transformers models if you don't need sequence parallelism or experts parallelism or other parallelism and cuda kernal optimize in VeOmni. We design a model registry mechanism. When the model is registered in veomni, we will automatically load the model and optimizer in VeOmni. Otherwise, it will default to load the modeling file in transformers.
If you want to add a new model, you can add a new model in the model registry. See in Support costom model docs.
Coming soon with tech report.
Thanks to the following projects for their excellent work:
Contributions from the community are welcome! Please check out CONTRIBUTING.md our project roadmap(To be updated),
This project is licensed under Apache License 2.0. See the LICENSE file for details.
If you find VeOmni useful for your research and applications, feel free to give us a star β or cite us using:
@software{VeOmni,
title={VeOmni: Scaling any Modality Model Training to any Accelerators with PyTorch native Training Framework},
author={Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin jia, Ziyue Huang, Zhi Zhang},
year={2025},
howpublished={GitHub repository},
publisher={ByteDance Seed},
url={https://github.com/ByteDance-Seed/VeOmni},
}
π± About ByteDance Seed Team
Founded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.
You can get to know us better through the following channelsπ