Add tau2-bench training cookbook and implementation#1156
Add tau2-bench training cookbook and implementation#1156jbarnes850 wants to merge 17 commits intoTHUDM:mainfrom
Conversation
This PR adds a complete implementation and training cookbook for tau2-bench, demonstrating how to train multi-turn tool-use agents using the SFT → RFT → GRPO pipeline. Key additions: - Complete tau2-bench training cookbook with methodology deep-dive - Tau2 implementation (rollout, eval, reward shaping, actions) - Training scripts for SFT and GRPO with shaped rewards - Unified eval.py supporting Pass@1 (greedy) and Pass@K (multi-sampling) - Performance results: 57.1% Pass@4 (4× baseline improvement) Resources: - Training data: tau2-sft-seed-v3 (~3K filtered trajectories) - Checkpoints: Qwen3-4B-tau2-sft1, Qwen3-4B-tau2-grpo-v1 - WandB logs: Full training metrics and sample outputs The cookbook explains credit assignment in multi-turn tasks, progressive training benefits, and turn-level reward shaping with research citations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Standardize on Qwen3 native function calling format only - Remove legacy action format support - Simplify action parsing and observation formatting - Clean up reward shaping and prompting utilities - Reduce code complexity across tau2 modules This commit reduces code by 1,369 lines while preserving all functionality.
f56c41a to
10f5405
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a comprehensive implementation and training cookbook for tau2-bench, demonstrating how to train multi-turn tool-use agents using a progressive SFT → RFT → GRPO pipeline. The implementation achieves 57.1% Pass@4 on tau2-bench (4× baseline improvement) with a 4B parameter model.
Key Changes:
- Complete tau2-bench training pipeline with SFT, rejection sampling (RFT), and GRPO stages
- Unified evaluation harness supporting both Pass@1 (greedy) and Pass@K (multi-sampling) metrics
- Dense reward shaping using turn-level partial scores with domain-adaptive weighting
Reviewed changes
Copilot reviewed 16 out of 23 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/tau-bench/training_cookbook.md | Comprehensive training guide with methodology, performance results, and reproduction instructions |
| examples/tau-bench/tau2/tasks.py | Task preprocessing script to generate JSONL index files for training |
| examples/tau-bench/tau2/rollout.py | Custom rollout function for GRPO with tau2-bench environment integration |
| examples/tau-bench/tau2/reward.py | Reward shaping implementation with domain-adaptive alpha and curriculum learning |
| examples/tau-bench/tau2/prompting.py | Compressed system prompts for reduced KV cache pressure during RL training |
| examples/tau-bench/tau2/actions.py | Action parsing supporting both native FC and legacy formats with robust error handling |
| examples/tau-bench/tau2/env.py | Environment wrapper utilities for tau2-bench with partial score computation |
| examples/tau-bench/tau2/eval.py | Unified evaluation script supporting Pass@K sampling with WandB/Weave integration |
| examples/tau-bench/tau2/run_sft.sh | SFT training script using filtered trajectories from rejection sampling |
| examples/tau-bench/tau2/run_grpo.sh | GRPO training script with shaped rewards and curriculum learning |
| examples/tau-bench/tau2/start_user_sim_server.sh | User simulator server startup script for multi-turn RL rollouts |
| examples/tau-bench/tau2/.env.template | Environment variable template for API keys and configuration |
| examples/tau-bench/tau2/README.md | Component overview and usage instructions |
| examples/tau-bench/README.md | Updated main README with tau1/tau2 benchmark comparison |
| examples/tau-bench/.gitignore | Ignore patterns for outputs and local files |
| examples/tau-bench/tau1/* | Legacy tau1 implementation files (context) |
Comments suppressed due to low confidence (4)
examples/tau-bench/tau2/eval.py:438
- The default user model is set to "gpt-4.1-mini" which does not exist. OpenAI's model naming convention is "gpt-4o-mini" or "gpt-4-turbo". The version "4.1" is not a valid OpenAI model identifier.
examples/tau-bench/tau2/env.py:192 - The variable name "denom" is abbreviated and unclear. Consider using "total_weight" or "weight_sum" for better code readability.
examples/tau-bench/tau2/reward.py:324 - The WandB logging code accesses
_curriculum_tracker._lockdirectly (line 302), which breaks encapsulation and could lead to maintenance issues. The underscore prefix indicates this is a private attribute that shouldn't be accessed outside the class. Consider adding a public method to the _TaskCurriculumTracker class that provides the needed statistics in a thread-safe manner.
examples/tau-bench/tau2/reward.py:327 - 'except' clause does nothing but pass and there is no explanatory comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Updated the README to improve clarity and formatting.
|
This is awesome — basically a Tau2 mega-pack 🚀 |
@Fengzdadi Ah, thank you! Very much appreciate the patch! |
|
Great work on this! I had a couple of quick questions, if you don’t mind:
|
- Fix tau2 SFT dataset defaults and document exact file selection - Make GRPO WandB logging optional to match docs - Clarify GPU allocation, eval temperature, and TAU2_DATA_DIR in cookbook - Add tau1 episode logging and resilient tool parsing for offline runs
- Bind Ray dashboard to localhost by default - Clarify public HF checkpoints in cookbook - Note tau1 stub provider for offline debugging
- Fix tau2 SFT dataset defaults and document exact file selection - Make GRPO WandB logging optional to match docs - Clarify GPU allocation, eval temperature, and TAU2_DATA_DIR in cookbook - Add tau1 episode logging and resilient tool parsing for offline runs
- Bind Ray dashboard to localhost by default - Clarify public HF checkpoints in cookbook - Note tau1 stub provider for offline debugging
@maocheng23 My apologies! I thought they were pubic:
|
|
Hi @jbarnes850 , I followed the instructions in your cookbook and trained the model from scratch, here's the result I got:
I'm using the
And you may find the training logs here I have a few questions if you don't mind:
Thank you so much! |
Hi @zijiexia, thanks for running this and for the detailed table + logs! Awesome work and apologies in advance for the confusion here. Quick clarifications to your questions below:
|
Thanks for the clarification! |
|
I've rerun the evaluation, here's the results I got: Settings: --num-samples 1
--temperature 1.0
--top-p 1.0
|
|
Thanks so much for the feedback and re-run here! A couple clarifications that should make the comparisons apples-to-apples:
I just re-ran the full eval on an A100 with gpt-4.1-mini and the cookbook pass@4 settings. I get pass@1 = 0.27 and pass@4 = 0.55 on 100 tasks. W&B run: https://wandb.ai/jbarnes850-near-protocol/slime-tau2-eval/runs/2d534fuo Also addressing the other review feedback from the thread:
Happy to talk through this further if any other revisions are needed! |
|
Thank you for the PR, but I'm afraid I need to close it as this does not serve as a good example for slime, which need to be simple and show some specific features. This PR could be a really nice isolated repo, similar to Alibaba-NLP/qqr. We'd love to add a reference of this implementation in the readme. |
This PR adds a complete implementation and training cookbook for tau2-bench, demonstrating how to train multi-turn tool-use agents using the SFT → RFT → GRPO pipeline.
Key additions
Resources
Highlights
The cookbook explains credit assignment in multi-turn tasks, progressive training benefits, and turn-level reward shaping.