Reproducible recipes for rejection sampling in LLM/VLM data synthesis.
- Easy to run: Pick a recipe and run it; no extra scaffolding.
- Ready-to-use recipes: Text + multimodal flows with answer parsing, a solid judge prompt, and safe image-resize fallbacks.
- Scales when data grows: Ray Data based pipeline, which gives streaming-style processing, batching, concurrency, and checkpoint/resume out of the box.
| Concept | Description |
|---|---|
| Stage | A single processing step (e.g., sampling, verification, formatting) |
| Recipe | A sequence of stages that defines a complete data processing workflow |
| Pipeline | The execution engine that runs recipes with batching, error handling, and checkpoint/resume |
rejection-sampling-recipes/
├── src/ # Core framework
│ ├── base.py # Stage and BaseRecipe base classes
│ ├── pipeline.py # Pipeline execution engine
│ └── utils/ # Data I/O utilities
├── recipes/ # Recipe implementations
│ ├── text_sft_simple/ # Text-only recipe
│ └── vl_cot_sft_plus_parse/ # Text + image recipe (with answer parsing)
├── scripts/ # Utility scripts
└── tests/ # Test files and mock data
- Python 3.10 or higher
- (Optional) uv for faster dependency management
uv syncpython -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt# Install with development dependencies
uv sync --extra dev
# Or with pip
pip install -e ".[dev]"All recipes read JSONL. Each item should have messages in OpenAI format.
Text-only (answer inside assistant response):
{"id": 1, "messages": [{"role": "user", "content": "Q?"}, {"role": "assistant", "content": "A"}]}Multimodal (with images):
{
"id": 1,
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "images/foo.jpg", "image_wh": [640, 480]}},
{"type": "text", "text": "What is shown?"}
]
},
{"role": "assistant", "content": "A short answer."}
],
"abs_path": "/abs/path/to/image/base"
}A simple text-only recipe for basic SFT data processing.
bash recipes/text_sft_simple/entrypoint/run.shA multimodal recipe with Chain-of-Thought and answer parsing.
# Step 1: Add absolute image paths ("abs_path")
python scripts/preprocess_images.py \
--input tests/mock/text-pic.jsonl \
--image-base-path /abs/path/to/images \
--abs-image-path-field abs_path
# Step 2: Run the recipe
bash recipes/vl_cot_sft_plus_parse/entrypoint/run-30b/run-30b.shSee scripts/launch_serve/README.md for model service setup and launch steps.
Note: The default launch scripts include
ray stop. If you need to run multiple scripts, use separate machines or removeray stop.
| Environment Variable | Default | Description |
|---|---|---|
LOG_DIR |
logs/ |
Log directory (falls back to /tmp/rejection-sampling-recipes-logs/) |
LOG_MAX_BYTES |
10MB |
Maximum log file size |
LOG_BACKUP_COUNT |
5 |
Number of backup log files |
LOG_FILE_LEVEL |
DEBUG |
File logging level |
LOG_CONSOLE_LEVEL |
INFO |
Console logging level |
Log files:
pipeline.log- Driver logspipeline_worker_<pid>.log- Ray worker logs
We welcome contributions! Please see our Contributing Guide for details.
# Install dev dependencies
uv sync --extra dev
# Run linting
uvx ruff check .
# Run formatting
uvx ruff format .
# Run tests
pytestThis project is licensed under the MIT License - see the LICENSE file for details.
If you use this project in your research, please consider citing:
@software{rejection_sampling_recipes,
title = {Rejection Sampling Recipes},
author = {guox18},
year = {2024},
url = {https://github.com/guox18/rejection-sampling-recipes}
}