THUDM · jbarnes850 · Dec 19, 2025 · Dec 19, 2025 · Dec 19, 2025 · Dec 19, 2025
diff --git a/README.md b/README.md
@@ -47,6 +47,8 @@ For a comprehensive quick start guide covering environment setup, data preparati
 
 We also provide examples for some use cases not covered in the quick start guide; please check [examples](examples/).
 
+Python 3.10+ is required (uses `zip(..., strict=True)` in core utilities).
+
 ## Projects Built upon slime
 
 slime has powered several novel research projects and production systems. Here are some notable examples:

diff --git a/examples/tau-bench/.gitignore b/examples/tau-bench/.gitignore
@@ -0,0 +1,13 @@
+/outputs/
+
+# Local secrets (template is tracked)
+/tau2/.env
+
+# Python caches
+**/__pycache__/
+*.pyc
+
+# Logs / experiment trackers
+wandb/
+weave/
+*.log
diff --git a/examples/tau-bench/README.md b/examples/tau-bench/README.md
@@ -1,68 +1,19 @@
-# Tau bench 
-This example shows slime training in an agentic multi-turn tool use environment. 
+# Tau-Bench: Multi-Turn Tool-Use Training
 
+This folder provides two benchmark entrypoints with parallel conventions. The canonical documentation lives in `examples/tau-bench/training_cookbook.md`; other docs link into it without duplication.
 
-## Environment Setup 
-Use the `zhuzilin/slime:latest` image and initialize the environment required for Search-R1:
+| Benchmark | Repo | Domains | Dual-control | Primary metric | Folder |
+|----------|------|---------|--------------|----------------|--------|
+| Tau1 | https://github.com/sierra-research/tau-bench | airline, retail | no | pass@1 | `examples/tau-bench/tau1/` |
+| Tau2 | https://github.com/sierra-research/tau2-bench | airline, retail, telecom | yes (telecom user-only tools) | pass@4 + pass@1 | `examples/tau-bench/tau2/` |
 
-```bash
-cd /root/
-git clone https://github.com/THUDM/slime.git
-cd slime
-pip install -e .
-# for tau bench 
-cd /root/
-git clone https://github.com/JD-ETH/tau-bench.git
-cd tau-bench
-git checkout feature/litellm-retry
-pip install -e . 
-```
+### Quick Links
+- Training cookbook: `examples/tau-bench/training_cookbook.md`.
+- Tau1 README: `examples/tau-bench/tau1/README.md`.
+- Tau2 implementation: `examples/tau-bench/tau2/README.md`
 
-Use the following script to generate mock data for slime training. 
+Note: Tau1 includes a small offline stub for debug/CI without external API keys.
 
-```bash
-cd /root/slime/examples/tau-bench
-python tau1_mock.py --local_dir /root/tau-bench/
-```
+### Outputs
 
-Initialize the Qwen2.5-3B-Instruct model needed for tool use:
-
-```bash
-# hf checkpoint
-huggingface-cli download Qwen/Qwen3-4B-Instruct-2507 --local-dir /root/Qwen3-4B-Instruct-2507
-
-# mcore checkpoint
-cd /root/slime
-source scripts/models/qwen3-4B-Instruct-2507.sh
-PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
-    ${MODEL_ARGS[@]} \
-    --hf-checkpoint /root/Qwen3-4B-Instruct-2507 \
-    --save /root/Qwen3-4B-Instruct-2507_torch_dist
-```
-
-## Running the Script
-
-You need to configure your litellm API in `generate_with_tau.py` for user simulation:
-
-```python
-TAU_CONFIGS = {
-    "env": "retail",  # Select between ["retail", "airline"]
-    "agent": "tool-calling",  # Select between ["tool-calling", "act", "react", "few-shot"], only tool-calling implemented for now
-    "user_model": "gemini-2.0-flash-lite",  # Cheap Model for user simulator
-    "user_model_provider": "gemini",
-    "task_split": "train",  # Select between ["train", "test", "dev"] for retail, ["test"] for airline
-    "user_strategy": "llm",  # Select between ["llm", "react", "verify", "reflection"]
-    "model_provider": "auto_router", # Unused, required
-    "model": "qwen3-4b", # Unused, reqired
-}
-# Replace with your actual API key for user sim    
-GEMINI_API_KEY = "YOUR KEY" 
-```
-
-And run:
-
-
-```bash
-cd /root/slime
-bash examples/tau-bench/run_qwen3_4B.sh
-```
+All generated artifacts are written under `TAU_BENCH_OUT_DIR` (default: `examples/tau-bench/outputs`) and are gitignored. The cookbook assumes the `slimerl/slime:latest` container baseline.
diff --git a/examples/tau-bench/public/performance-chart.jpeg b/examples/tau-bench/public/performance-chart.jpeg
diff --git a/examples/tau-bench/public/slime-pipeline-tau2.jpeg b/examples/tau-bench/public/slime-pipeline-tau2.jpeg
diff --git a/examples/tau-bench/sglang_tool_parser.py b/examples/tau-bench/sglang_tool_parser.py
diff --git a/examples/tau-bench/tau1/README.md b/examples/tau-bench/tau1/README.md
@@ -0,0 +1,74 @@
+# Tau1 Bench (tau-bench)
+
+This example shows slime training in an agentic multi-turn tool use environment.
+
+## Environment Setup
+Use the `slimerl/slime:latest` image and initialize the environment required for Tau1:
+
+```bash
+cd /root/
+git clone https://github.com/THUDM/slime.git
+cd slime
+pip install -e .
+# for tau bench
+cd /root/
+git clone https://github.com/JD-ETH/tau-bench.git
+cd tau-bench
+git checkout feature/litellm-retry
+pip install -e .
+```
+
+Use the following script to generate mock data for slime training.
+
+```bash
+cd /root/slime/examples/tau-bench/tau1
+python tau1_mock.py --local_dir /root/tau-bench/
+```
+
+Initialize the Qwen3-4B-Instruct model needed for tool use:
+
+```bash
+# hf checkpoint
+huggingface-cli download Qwen/Qwen3-4B-Instruct-2507 --local-dir /root/Qwen3-4B-Instruct-2507
+
+# mcore checkpoint
+cd /root/slime
+source scripts/models/qwen3-4B-Instruct-2507.sh
+PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
+    ${MODEL_ARGS[@]} \
+    --hf-checkpoint /root/Qwen3-4B-Instruct-2507 \
+    --save /root/Qwen3-4B-Instruct-2507_torch_dist
+```
+
+## Running the Script
+
+You need to configure your litellm API in `generate_with_tau.py` for user simulation:
+
+```python
+TAU_CONFIGS = {
+    "env": "retail",  # Select between ["retail", "airline"]
+    "agent": "tool-calling",  # Select between ["tool-calling", "act", "react", "few-shot"], only tool-calling implemented for now
+    "user_model": "gemini-2.0-flash-lite",  # Cheap Model for user simulator
+    "user_model_provider": "gemini",
+    "task_split": "train",  # Select between ["train", "test", "dev"] for retail, ["test"] for airline
+    "user_strategy": "llm",  # Select between ["llm", "react", "verify", "reflection"]
+    "model_provider": "auto_router", # Unused, required
+    "model": "qwen3-4b", # Unused, required
+}
+# Replace with your actual API key for user sim
+GEMINI_API_KEY = "YOUR KEY"
+```
+
+And run:
+
+```bash
+cd /root/slime
+bash examples/tau-bench/tau1/run_qwen3_4B.sh
+```
+
+## Known gotchas
+- If you use an OpenAI-compatible server (e.g., sglang), set `OPENAI_API_BASE` and run tau-bench with `--model-provider openai` (not `openai_like`).
+- For tau-bench CLI runs, use a slashless `--model` name (e.g., `Qwen3-4B-Instruct-2507`) to avoid log path errors.
+
+## Debugging and CI notes
+- For offline or CPU-only debugging, you can set `user_model_provider="stub"` in `generate_with_tau.py` to bypass external API calls while preserving episode logging.
diff --git a/examples/tau-bench/tau1/episode_logger.py b/examples/tau-bench/tau1/episode_logger.py
@@ -0,0 +1,60 @@
+from __future__ import annotations
+
+import hashlib
+import json
+import os
+import time
+from dataclasses import dataclass
+from typing import Any
+
+
+def _truncate(value: str | None, max_chars: int = 8000) -> str | None:
+    if value is None:
+        return None
+    if len(value) <= max_chars:
+        return value
+    return value[:max_chars] + f"\n...[truncated {len(value) - max_chars} chars]"
+
+
+def _sha256_text(value: str) -> str:
+    return hashlib.sha256(value.encode("utf-8", errors="ignore")).hexdigest()
+
+
+@dataclass
+class EpisodeLogger:
+    log_dir: str
+    run_meta: dict[str, Any]
+
+    def __post_init__(self) -> None:
+        os.makedirs(self.log_dir, exist_ok=True)
+        self._jsonl_path = os.path.join(self.log_dir, "episode.jsonl")
+        self._summary_path = os.path.join(self.log_dir, "summary.json")
+        self._run_meta_path = os.path.join(self.log_dir, "run_meta.json")
+
+        with open(self._run_meta_path, "w", encoding="utf-8") as f:
+            json.dump(self.run_meta, f, ensure_ascii=False, indent=2)
+
+    def log_step(self, record: dict[str, Any]) -> None:
+        # Control field size: truncate long strings and append hashes.
+        for key in [
+            "assistant_raw",
+            "assistant",
+            "user_text",
+            "observation",
+            "tool_result",
+            "env_state",
+            "normal_text",
+            "tool_parse_error",
+            "error",
+        ]:
+            if key in record and isinstance(record[key], str):
+                record[f"{key}_hash"] = _sha256_text(record[key])
+                record[key] = _truncate(record[key])
+
+        record["ts"] = time.time()
+        with open(self._jsonl_path, "a", encoding="utf-8") as f:
+            f.write(json.dumps(record, ensure_ascii=False) + "\n")
+
+    def finalize(self, summary: dict[str, Any]) -> None:
+        with open(self._summary_path, "w", encoding="utf-8") as f:
+            json.dump(summary, f, ensure_ascii=False, indent=2)