Skip to content

guox18/rejection-sampling-recipes

Repository files navigation

Rejection Sampling Recipes

Ruff Python 3.10+ License: MIT Code style: ruff

Reproducible recipes for rejection sampling in LLM/VLM data synthesis.

Why This Repo

  • Easy to run: Pick a recipe and run it; no extra scaffolding.
  • Ready-to-use recipes: Text + multimodal flows with answer parsing, a solid judge prompt, and safe image-resize fallbacks.
  • Scales when data grows: Ray Data based pipeline, which gives streaming-style processing, batching, concurrency, and checkpoint/resume out of the box.

Core Concepts

Concept Description
Stage A single processing step (e.g., sampling, verification, formatting)
Recipe A sequence of stages that defines a complete data processing workflow
Pipeline The execution engine that runs recipes with batching, error handling, and checkpoint/resume

Project Structure

rejection-sampling-recipes/
├── src/                          # Core framework
│   ├── base.py                   # Stage and BaseRecipe base classes
│   ├── pipeline.py               # Pipeline execution engine
│   └── utils/                    # Data I/O utilities
├── recipes/                      # Recipe implementations
│   ├── text_sft_simple/          # Text-only recipe
│   └── vl_cot_sft_plus_parse/    # Text + image recipe (with answer parsing)
├── scripts/                      # Utility scripts
└── tests/                        # Test files and mock data

Installation

Prerequisites

  • Python 3.10 or higher
  • (Optional) uv for faster dependency management

Option A: Using uv (Recommended)

uv sync

Option B: Using pip

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Development Installation

# Install with development dependencies
uv sync --extra dev

# Or with pip
pip install -e ".[dev]"

Quick Start

All recipes read JSONL. Each item should have messages in OpenAI format.

Input Format Examples

Text-only (answer inside assistant response):

{"id": 1, "messages": [{"role": "user", "content": "Q?"}, {"role": "assistant", "content": "A"}]}

Multimodal (with images):

{
  "id": 1,
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "images/foo.jpg", "image_wh": [640, 480]}},
        {"type": "text", "text": "What is shown?"}
      ]
    },
    {"role": "assistant", "content": "A short answer."}
  ],
  "abs_path": "/abs/path/to/image/base"
}

Available Recipes

1. text_sft_simple (Text-only)

A simple text-only recipe for basic SFT data processing.

bash recipes/text_sft_simple/entrypoint/run.sh

2. vl_cot_sft_plus_parse (Text + Image + Answer Parsing)

A multimodal recipe with Chain-of-Thought and answer parsing.

# Step 1: Add absolute image paths ("abs_path")
python scripts/preprocess_images.py \
  --input tests/mock/text-pic.jsonl \
  --image-base-path /abs/path/to/images \
  --abs-image-path-field abs_path

# Step 2: Run the recipe
bash recipes/vl_cot_sft_plus_parse/entrypoint/run-30b/run-30b.sh

Model Service Setup

See scripts/launch_serve/README.md for model service setup and launch steps.

Note: The default launch scripts include ray stop. If you need to run multiple scripts, use separate machines or remove ray stop.

Logging

Environment Variable Default Description
LOG_DIR logs/ Log directory (falls back to /tmp/rejection-sampling-recipes-logs/)
LOG_MAX_BYTES 10MB Maximum log file size
LOG_BACKUP_COUNT 5 Number of backup log files
LOG_FILE_LEVEL DEBUG File logging level
LOG_CONSOLE_LEVEL INFO Console logging level

Log files:

  • pipeline.log - Driver logs
  • pipeline_worker_<pid>.log - Ray worker logs

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Workflow

# Install dev dependencies
uv sync --extra dev

# Run linting
uvx ruff check .

# Run formatting
uvx ruff format .

# Run tests
pytest

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this project in your research, please consider citing:

@software{rejection_sampling_recipes,
  title = {Rejection Sampling Recipes},
  author = {guox18},
  year = {2024},
  url = {https://github.com/guox18/rejection-sampling-recipes}
}

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors