Rejection Sampling Recipes

Reproducible recipes for rejection sampling in LLM/VLM data synthesis.

Why This Repo

Easy to run: Pick a recipe and run it; no extra scaffolding.
Ready-to-use recipes: Text + multimodal flows with answer parsing, a solid judge prompt, and safe image-resize fallbacks.
Scales when data grows: Ray Data based pipeline, which gives streaming-style processing, batching, concurrency, and checkpoint/resume out of the box.

Core Concepts

Concept	Description
Stage	A single processing step (e.g., sampling, verification, formatting)
Recipe	A sequence of stages that defines a complete data processing workflow
Pipeline	The execution engine that runs recipes with batching, error handling, and checkpoint/resume

Project Structure

rejection-sampling-recipes/
├── src/                          # Core framework
│   ├── base.py                   # Stage and BaseRecipe base classes
│   ├── pipeline.py               # Pipeline execution engine
│   └── utils/                    # Data I/O utilities
├── recipes/                      # Recipe implementations
│   ├── text_sft_simple/          # Text-only recipe
│   └── vl_cot_sft_plus_parse/    # Text + image recipe (with answer parsing)
├── scripts/                      # Utility scripts
└── tests/                        # Test files and mock data

Installation

Prerequisites

Python 3.10 or higher
(Optional) uv for faster dependency management

Option A: Using uv (Recommended)

uv sync

Option B: Using pip

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Development Installation

# Install with development dependencies
uv sync --extra dev

# Or with pip
pip install -e ".[dev]"

Quick Start

All recipes read JSONL. Each item should have messages in OpenAI format.

Input Format Examples

Text-only (answer inside assistant response):

{"id": 1, "messages": [{"role": "user", "content": "Q?"}, {"role": "assistant", "content": "A"}]}

Multimodal (with images):

{
  "id": 1,
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "images/foo.jpg", "image_wh": [640, 480]}},
        {"type": "text", "text": "What is shown?"}
      ]
    },
    {"role": "assistant", "content": "A short answer."}
  ],
  "abs_path": "/abs/path/to/image/base"
}

Available Recipes

1. `text_sft_simple` (Text-only)

A simple text-only recipe for basic SFT data processing.

bash recipes/text_sft_simple/entrypoint/run.sh

2. `vl_cot_sft_plus_parse` (Text + Image + Answer Parsing)

A multimodal recipe with Chain-of-Thought and answer parsing.

# Step 1: Add absolute image paths ("abs_path")
python scripts/preprocess_images.py \
  --input tests/mock/text-pic.jsonl \
  --image-base-path /abs/path/to/images \
  --abs-image-path-field abs_path

# Step 2: Run the recipe
bash recipes/vl_cot_sft_plus_parse/entrypoint/run-30b/run-30b.sh

Model Service Setup

See scripts/launch_serve/README.md for model service setup and launch steps.

Note: The default launch scripts include ray stop. If you need to run multiple scripts, use separate machines or remove ray stop.

Logging

Environment Variable	Default	Description
`LOG_DIR`	`logs/`	Log directory (falls back to `/tmp/rejection-sampling-recipes-logs/`)
`LOG_MAX_BYTES`	`10MB`	Maximum log file size
`LOG_BACKUP_COUNT`	`5`	Number of backup log files
`LOG_FILE_LEVEL`	`DEBUG`	File logging level
`LOG_CONSOLE_LEVEL`	`INFO`	Console logging level

Log files:

pipeline.log - Driver logs
pipeline_worker_<pid>.log - Ray worker logs

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Workflow

# Install dev dependencies
uv sync --extra dev

# Run linting
uvx ruff check .

# Run formatting
uvx ruff format .

# Run tests
pytest

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this project in your research, please consider citing:

@software{rejection_sampling_recipes,
  title = {Rejection Sampling Recipes},
  author = {guox18},
  year = {2024},
  url = {https://github.com/guox18/rejection-sampling-recipes}
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
recipes		recipes
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
README_CN.md		README_CN.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rejection Sampling Recipes

Why This Repo

Core Concepts

Project Structure

Installation

Prerequisites

Option A: Using uv (Recommended)

Option B: Using pip

Development Installation

Quick Start

Input Format Examples

Available Recipes

1. `text_sft_simple` (Text-only)

2. `vl_cot_sft_plus_parse` (Text + Image + Answer Parsing)

Model Service Setup

Logging

Contributing

Development Workflow

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rejection Sampling Recipes

Why This Repo

Core Concepts

Project Structure

Installation

Prerequisites

Option A: Using uv (Recommended)

Option B: Using pip

Development Installation

Quick Start

Input Format Examples

Available Recipes

1. text_sft_simple (Text-only)

2. vl_cot_sft_plus_parse (Text + Image + Answer Parsing)

Model Service Setup

Logging

Contributing

Development Workflow

License

Citation

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `text_sft_simple` (Text-only)

2. `vl_cot_sft_plus_parse` (Text + Image + Answer Parsing)

Packages