Paper: 📖 Seg-Zero
HuggingFace Daily: 🤗 Seg-Zero
Data: 🤗 RefCOCOg-2K
Model: 🤗 Seg-Zero-7B
Overview of Seg-Zero:
Seg-Zero demonstrates following features:
- Seg-Zero exhibits emergent test-time reasoning ability. It generates a reasoning chain before producing the final segmentation mask.
- Seg-Zero is trained exclusively using reinforcement learning, without any explicit supervised reasoning data.
- Compared to supervised fine-tuning, our Seg-Zero achieves superior performance on both in-domain and out-of-domain data.
Highlight Code Features:
- This code is based on the EasyR1 and veRL, which supports model split during sampling and is more GPU memory friendly.
- Supporting both Qwen2-VL and Qwen2.5-VL series models.
- Already implementing commonly used rewards in Object Detection and Object Segmentation, including IoU reward and L1 reward.
[March 11th, 2025] 🔥 Paper is coming!
[March 8th, 2025] 🔥 Seg-Zero is coming! We have released the code and training data.
Seg-Zero employs a decoupled architecture, including a reasoning model and segmentation model. We manually design a sophiscated reward mechanism that integrates both the format and accuracy rewards.
git clone https://github.com/dvlab-research/Seg-Zero.git
cd Seg-Zero
conda create -n seg_zero python=3.11
conda activate seg_zero
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
pip install -e .
pip install sam2
pip install matplotlib
python inference_scripts/infer.py
The default question is
"the unusal object in the image."
You will get the thinking process in command line, like:
"The image shows a bicycle with wheels that have been replaced with large, round objects resembling watermelon slices. The unusual aspect of the image is the substitution of the bicycle wheels with these watermelon-like objects, which is not a typical feature of a bicycle. The rest of the bicycle appears to be a standard design, but the wheels are the focal point of the image."
And the mask will be presented in inference_scripts folder.
You can also provide your own image_path and text by:
python inference_scripts/infer.py --image_path "your_image_path" --text "your question text"
bash training_scripts/run_qwen2_5_3b_refCOCOg.sh
You can try change the following hyper-parameters if you have a large GPU memory.
worker.actor.micro_batch_size_per_device_for_update=4 or 8 or 16 \
worker.actor.micro_batch_size_per_device_for_experience=4 or 8 or 16 \
If your GPU has less memory, you can change the following config. The number is depend on your GPU memory.
worker.rollout.tensor_parallel_size=[your number between 1-8]
worker.rollout.gpu_memory_utilization=[your number between 0-1]
worker.rollout.n=[your number between 4-32]
python3 training_scripts/model_merger.py --local_dir [path_to_your_actor_checkpoint]
Tip
If you encounter issues with connecting to Hugging Face, consider using export HF_ENDPOINT=https://hf-mirror.com
.
Seg-Zero generates several samples, calculates the rewards and then optimizes towards samples that achieve higher rewards.
Tip
To learn more about the GRPO algorithm, you can refer to Hugging Face's blog.
@article{liu2025segzero,
title = {Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement},
author = {Liu, Yuqi and Peng, Bohao and Zhong, Zhisheng and Yue, Zihao and Lu, Fanbin and Yu, Bei and Jia, Jiaya},
journal = {arXiv preprint arXiv:2503.06520},
year = {2025}
}
We would like to thank the following repos for their great work:
- This work is built upon the EasyR1 and veRL.
- This work utilizes models from Qwen2-VL, Qwen2.5-VL and SAM2.