Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ For a comprehensive quick start guide covering environment setup, data preparati

We also provide examples for some use cases not covered in the quick start guide; please check [examples](examples/).

Python 3.10+ is required (uses `zip(..., strict=True)` in core utilities).

## Projects Built upon slime

slime has powered several novel research projects and production systems. Here are some notable examples:
Expand Down
13 changes: 13 additions & 0 deletions examples/tau-bench/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
/outputs/

# Local secrets (template is tracked)
/tau2/.env

# Python caches
**/__pycache__/
*.pyc

# Logs / experiment trackers
wandb/
weave/
*.log
75 changes: 13 additions & 62 deletions examples/tau-bench/README.md
Original file line number Diff line number Diff line change
@@ -1,68 +1,19 @@
# Tau bench
This example shows slime training in an agentic multi-turn tool use environment.
# Tau-Bench: Multi-Turn Tool-Use Training

This folder provides two benchmark entrypoints with parallel conventions. The canonical documentation lives in `examples/tau-bench/training_cookbook.md`; other docs link into it without duplication.

## Environment Setup
Use the `zhuzilin/slime:latest` image and initialize the environment required for Search-R1:
| Benchmark | Repo | Domains | Dual-control | Primary metric | Folder |
|----------|------|---------|--------------|----------------|--------|
| Tau1 | https://github.com/sierra-research/tau-bench | airline, retail | no | pass@1 | `examples/tau-bench/tau1/` |
| Tau2 | https://github.com/sierra-research/tau2-bench | airline, retail, telecom | yes (telecom user-only tools) | pass@4 + pass@1 | `examples/tau-bench/tau2/` |

```bash
cd /root/
git clone https://github.com/THUDM/slime.git
cd slime
pip install -e .
# for tau bench
cd /root/
git clone https://github.com/JD-ETH/tau-bench.git
cd tau-bench
git checkout feature/litellm-retry
pip install -e .
```
### Quick Links
- Training cookbook: `examples/tau-bench/training_cookbook.md`.
- Tau1 README: `examples/tau-bench/tau1/README.md`.
- Tau2 implementation: `examples/tau-bench/tau2/README.md`

Use the following script to generate mock data for slime training.
Note: Tau1 includes a small offline stub for debug/CI without external API keys.

```bash
cd /root/slime/examples/tau-bench
python tau1_mock.py --local_dir /root/tau-bench/
```
### Outputs

Initialize the Qwen2.5-3B-Instruct model needed for tool use:

```bash
# hf checkpoint
huggingface-cli download Qwen/Qwen3-4B-Instruct-2507 --local-dir /root/Qwen3-4B-Instruct-2507

# mcore checkpoint
cd /root/slime
source scripts/models/qwen3-4B-Instruct-2507.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/Qwen3-4B-Instruct-2507 \
--save /root/Qwen3-4B-Instruct-2507_torch_dist
```

## Running the Script

You need to configure your litellm API in `generate_with_tau.py` for user simulation:

```python
TAU_CONFIGS = {
"env": "retail", # Select between ["retail", "airline"]
"agent": "tool-calling", # Select between ["tool-calling", "act", "react", "few-shot"], only tool-calling implemented for now
"user_model": "gemini-2.0-flash-lite", # Cheap Model for user simulator
"user_model_provider": "gemini",
"task_split": "train", # Select between ["train", "test", "dev"] for retail, ["test"] for airline
"user_strategy": "llm", # Select between ["llm", "react", "verify", "reflection"]
"model_provider": "auto_router", # Unused, required
"model": "qwen3-4b", # Unused, reqired
}
# Replace with your actual API key for user sim
GEMINI_API_KEY = "YOUR KEY"
```

And run:


```bash
cd /root/slime
bash examples/tau-bench/run_qwen3_4B.sh
```
All generated artifacts are written under `TAU_BENCH_OUT_DIR` (default: `examples/tau-bench/outputs`) and are gitignored. The cookbook assumes the `slimerl/slime:latest` container baseline.
Binary file added examples/tau-bench/public/performance-chart.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
31 changes: 0 additions & 31 deletions examples/tau-bench/sglang_tool_parser.py

This file was deleted.

74 changes: 74 additions & 0 deletions examples/tau-bench/tau1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Tau1 Bench (tau-bench)

This example shows slime training in an agentic multi-turn tool use environment.

## Environment Setup
Use the `slimerl/slime:latest` image and initialize the environment required for Tau1:

```bash
cd /root/
git clone https://github.com/THUDM/slime.git
cd slime
pip install -e .
# for tau bench
cd /root/
git clone https://github.com/JD-ETH/tau-bench.git
cd tau-bench
git checkout feature/litellm-retry
pip install -e .
```

Use the following script to generate mock data for slime training.

```bash
cd /root/slime/examples/tau-bench/tau1
python tau1_mock.py --local_dir /root/tau-bench/
```

Initialize the Qwen3-4B-Instruct model needed for tool use:

```bash
# hf checkpoint
huggingface-cli download Qwen/Qwen3-4B-Instruct-2507 --local-dir /root/Qwen3-4B-Instruct-2507

# mcore checkpoint
cd /root/slime
source scripts/models/qwen3-4B-Instruct-2507.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/Qwen3-4B-Instruct-2507 \
--save /root/Qwen3-4B-Instruct-2507_torch_dist
```

## Running the Script

You need to configure your litellm API in `generate_with_tau.py` for user simulation:

```python
TAU_CONFIGS = {
"env": "retail", # Select between ["retail", "airline"]
"agent": "tool-calling", # Select between ["tool-calling", "act", "react", "few-shot"], only tool-calling implemented for now
"user_model": "gemini-2.0-flash-lite", # Cheap Model for user simulator
"user_model_provider": "gemini",
"task_split": "train", # Select between ["train", "test", "dev"] for retail, ["test"] for airline
"user_strategy": "llm", # Select between ["llm", "react", "verify", "reflection"]
"model_provider": "auto_router", # Unused, required
"model": "qwen3-4b", # Unused, required
}
# Replace with your actual API key for user sim
GEMINI_API_KEY = "YOUR KEY"
```

And run:

```bash
cd /root/slime
bash examples/tau-bench/tau1/run_qwen3_4B.sh
```

## Known gotchas
- If you use an OpenAI-compatible server (e.g., sglang), set `OPENAI_API_BASE` and run tau-bench with `--model-provider openai` (not `openai_like`).
- For tau-bench CLI runs, use a slashless `--model` name (e.g., `Qwen3-4B-Instruct-2507`) to avoid log path errors.

## Debugging and CI notes
- For offline or CPU-only debugging, you can set `user_model_provider="stub"` in `generate_with_tau.py` to bypass external API calls while preserving episode logging.
60 changes: 60 additions & 0 deletions examples/tau-bench/tau1/episode_logger.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
from __future__ import annotations

import hashlib
import json
import os
import time
from dataclasses import dataclass
from typing import Any


def _truncate(value: str | None, max_chars: int = 8000) -> str | None:
if value is None:
return None
if len(value) <= max_chars:
return value
return value[:max_chars] + f"\n...[truncated {len(value) - max_chars} chars]"


def _sha256_text(value: str) -> str:
return hashlib.sha256(value.encode("utf-8", errors="ignore")).hexdigest()


@dataclass
class EpisodeLogger:
log_dir: str
run_meta: dict[str, Any]

def __post_init__(self) -> None:
os.makedirs(self.log_dir, exist_ok=True)
self._jsonl_path = os.path.join(self.log_dir, "episode.jsonl")
self._summary_path = os.path.join(self.log_dir, "summary.json")
self._run_meta_path = os.path.join(self.log_dir, "run_meta.json")

with open(self._run_meta_path, "w", encoding="utf-8") as f:
json.dump(self.run_meta, f, ensure_ascii=False, indent=2)

def log_step(self, record: dict[str, Any]) -> None:
# Control field size: truncate long strings and append hashes.
for key in [
"assistant_raw",
"assistant",
"user_text",
"observation",
"tool_result",
"env_state",
"normal_text",
"tool_parse_error",
"error",
]:
if key in record and isinstance(record[key], str):
record[f"{key}_hash"] = _sha256_text(record[key])
record[key] = _truncate(record[key])

record["ts"] = time.time()
with open(self._jsonl_path, "a", encoding="utf-8") as f:
f.write(json.dumps(record, ensure_ascii=False) + "\n")

def finalize(self, summary: dict[str, Any]) -> None:
with open(self._summary_path, "w", encoding="utf-8") as f:
json.dump(summary, f, ensure_ascii=False, indent=2)
Loading