Skip to content

Add tau2-bench training cookbook and implementation#1156

Closed
jbarnes850 wants to merge 17 commits intoTHUDM:mainfrom
jbarnes850:tau2-sft-rl-pipeline
Closed

Add tau2-bench training cookbook and implementation#1156
jbarnes850 wants to merge 17 commits intoTHUDM:mainfrom
jbarnes850:tau2-sft-rl-pipeline

Conversation

@jbarnes850
Copy link

@jbarnes850 jbarnes850 commented Dec 19, 2025

This PR adds a complete implementation and training cookbook for tau2-bench, demonstrating how to train multi-turn tool-use agents using the SFT → RFT → GRPO pipeline.

performance-chart

Key additions

  • Complete tau2-bench training cookbook with methodology deep-dive
  • Tau2 implementation (rollout, eval, reward shaping, actions)
  • Training scripts for SFT and GRPO with shaped rewards
  • Unified eval.py supporting Pass@1 (greedy) and Pass@K (multi-sampling)
  • Performance results: 57.1% Pass@4 (4× baseline improvement)

slime-pipeline-tau2

Resources

Highlights

The cookbook explains credit assignment in multi-turn tasks, progressive training benefits, and turn-level reward shaping.

Copilot AI review requested due to automatic review settings December 19, 2025 19:50
jbarnes850 and others added 2 commits December 19, 2025 14:53
This PR adds a complete implementation and training cookbook for tau2-bench, demonstrating how to train multi-turn tool-use agents using the SFT → RFT → GRPO pipeline.

Key additions:
- Complete tau2-bench training cookbook with methodology deep-dive
- Tau2 implementation (rollout, eval, reward shaping, actions)
- Training scripts for SFT and GRPO with shaped rewards
- Unified eval.py supporting Pass@1 (greedy) and Pass@K (multi-sampling)
- Performance results: 57.1% Pass@4 (4× baseline improvement)

Resources:
- Training data: tau2-sft-seed-v3 (~3K filtered trajectories)
- Checkpoints: Qwen3-4B-tau2-sft1, Qwen3-4B-tau2-grpo-v1
- WandB logs: Full training metrics and sample outputs

The cookbook explains credit assignment in multi-turn tasks, progressive training benefits, and turn-level reward shaping with research citations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Standardize on Qwen3 native function calling format only
- Remove legacy action format support
- Simplify action parsing and observation formatting
- Clean up reward shaping and prompting utilities
- Reduce code complexity across tau2 modules

This commit reduces code by 1,369 lines while preserving all functionality.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive implementation and training cookbook for tau2-bench, demonstrating how to train multi-turn tool-use agents using a progressive SFT → RFT → GRPO pipeline. The implementation achieves 57.1% Pass@4 on tau2-bench (4× baseline improvement) with a 4B parameter model.

Key Changes:

  • Complete tau2-bench training pipeline with SFT, rejection sampling (RFT), and GRPO stages
  • Unified evaluation harness supporting both Pass@1 (greedy) and Pass@K (multi-sampling) metrics
  • Dense reward shaping using turn-level partial scores with domain-adaptive weighting

Reviewed changes

Copilot reviewed 16 out of 23 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
examples/tau-bench/training_cookbook.md Comprehensive training guide with methodology, performance results, and reproduction instructions
examples/tau-bench/tau2/tasks.py Task preprocessing script to generate JSONL index files for training
examples/tau-bench/tau2/rollout.py Custom rollout function for GRPO with tau2-bench environment integration
examples/tau-bench/tau2/reward.py Reward shaping implementation with domain-adaptive alpha and curriculum learning
examples/tau-bench/tau2/prompting.py Compressed system prompts for reduced KV cache pressure during RL training
examples/tau-bench/tau2/actions.py Action parsing supporting both native FC and legacy formats with robust error handling
examples/tau-bench/tau2/env.py Environment wrapper utilities for tau2-bench with partial score computation
examples/tau-bench/tau2/eval.py Unified evaluation script supporting Pass@K sampling with WandB/Weave integration
examples/tau-bench/tau2/run_sft.sh SFT training script using filtered trajectories from rejection sampling
examples/tau-bench/tau2/run_grpo.sh GRPO training script with shaped rewards and curriculum learning
examples/tau-bench/tau2/start_user_sim_server.sh User simulator server startup script for multi-turn RL rollouts
examples/tau-bench/tau2/.env.template Environment variable template for API keys and configuration
examples/tau-bench/tau2/README.md Component overview and usage instructions
examples/tau-bench/README.md Updated main README with tau1/tau2 benchmark comparison
examples/tau-bench/.gitignore Ignore patterns for outputs and local files
examples/tau-bench/tau1/* Legacy tau1 implementation files (context)
Comments suppressed due to low confidence (4)

examples/tau-bench/tau2/eval.py:438

  • The default user model is set to "gpt-4.1-mini" which does not exist. OpenAI's model naming convention is "gpt-4o-mini" or "gpt-4-turbo". The version "4.1" is not a valid OpenAI model identifier.
    examples/tau-bench/tau2/env.py:192
  • The variable name "denom" is abbreviated and unclear. Consider using "total_weight" or "weight_sum" for better code readability.
    examples/tau-bench/tau2/reward.py:324
  • The WandB logging code accesses _curriculum_tracker._lock directly (line 302), which breaks encapsulation and could lead to maintenance issues. The underscore prefix indicates this is a private attribute that shouldn't be accessed outside the class. Consider adding a public method to the _TaskCurriculumTracker class that provides the needed statistics in a thread-safe manner.
    examples/tau-bench/tau2/reward.py:327
  • 'except' clause does nothing but pass and there is no explanatory comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Fengzdadi
Copy link

This is awesome — basically a Tau2 mega-pack 🚀
I also added a small QoL patch in #1158 (offline stub user + tool parser fallback) to make tau-bench runnable without external API keys for debugging/logging. Might be a useful reference.

@jbarnes850
Copy link
Author

This is awesome — basically a Tau2 mega-pack 🚀 I also added a small QoL patch in #1158 (offline stub user + tool parser fallback) to make tau-bench runnable without external API keys for debugging/logging. Might be a useful reference.

@Fengzdadi Ah, thank you! Very much appreciate the patch!

@maocheng23
Copy link
Contributor

Great work on this! I had a couple of quick questions, if you don’t mind:

  1. noticed that the training wandb logs are not public yet. Would it be possible to grant access? I’d be very interested in taking a look at the training curves.
  2. Would you be open to sharing the Stage 1 SFT training dataset as well?

- Fix tau2 SFT dataset defaults and document exact file selection
- Make GRPO WandB logging optional to match docs
- Clarify GPU allocation, eval temperature, and TAU2_DATA_DIR in cookbook
- Add tau1 episode logging and resilient tool parsing for offline runs
- Bind Ray dashboard to localhost by default
- Clarify public HF checkpoints in cookbook
- Note tau1 stub provider for offline debugging
- Fix tau2 SFT dataset defaults and document exact file selection
- Make GRPO WandB logging optional to match docs
- Clarify GPU allocation, eval temperature, and TAU2_DATA_DIR in cookbook
- Add tau1 episode logging and resilient tool parsing for offline runs
- Bind Ray dashboard to localhost by default
- Clarify public HF checkpoints in cookbook
- Note tau1 stub provider for offline debugging
@jbarnes850
Copy link
Author

Great work on this! I had a couple of quick questions, if you don’t mind:

  1. noticed that the training wandb logs are not public yet. Would it be possible to grant access? I’d be very interested in taking a look at the training curves.
  2. Would you be open to sharing the Stage 1 SFT training dataset as well?

@maocheng23 My apologies! I thought they were pubic:

  • The W&B project is now public and contains the SFT + GRPO v1 runs: https://wandb.ai/jbarnes850-near-protocol/tau2-cookbook.

  • The Stage 1 SFT dataset is also public here: https://huggingface.co/datasets/Jarrodbarnes/tau2-sft-seed-v3

  • I re-ran the eval to ensure full reproducibility - here is the exact config that enables the 57% pass @ 4 result (also listed in the cookbook now)

    • hf-checkpoint Jarrodbarnes/Qwen3-4B-tau2-grpo-v1
    • tau2-bench commit: 337326e62d8e0ca74c353b004a9c5d748e0ba914

    Eval command hyperparameters

    • --domains airline,retail,telecom
    • --task-split test
    • --num-samples 4
    • --temperature 0.8
    • --top-p 1.0
    • --top-k 20

    Env variables

    • TAU2_USE_COMPRESSED_PROMPTS=0
    • TAU2_USER_MODEL=gpt-4.1-mini
    • TAU2_USER_TEMPERATURE=0.7

    Policy server

    • --model-path Jarrodbarnes/Qwen3-4B-tau2-grpo-v1
    • --tp 1
    • --mem-fraction-static 0.70
    • --port 30000

@zijiexia
Copy link
Contributor

Hi @jbarnes850 , I followed the instructions in your cookbook and trained the model from scratch, here's the result I got:

Stage Overall Airline Retail Telecom
Baseline (Qwen3-4B-Instruct, pass@1) 4% 5% 5% 2.5%
Baseline (Qwen3-4B-Instruct, pass@4) 13% 25% 12.5% 7.5%
SFT1 (pass@1) 28% 10% 40% 25%
SFT1 (pass@4) 55% 30% 72.5% 50%
GRPO (Pass@1) 36% 10% 60% 25%
GRPO (Pass@4) 60% 35% 80% 52.5%

I'm using the eval.py in the PR for the evaluation and using the same args and env variables as you mentioned in the cookbook:

  • --domains airline,retail,telecom
  • --task-split test
  • --num-samples 4
  • --temperature 0.8
  • --top-p 1.0
  • --top-k 20
  • TAU2_USE_COMPRESSED_PROMPTS=0
  • TAU2_USER_MODEL=gpt-4.1-mini
  • TAU2_USER_TEMPERATURE=0.7

And you may find the training logs here

I have a few questions if you don't mind:

  1. The result of SFT has a relatively large difference from what you present in the cookbook, may I ask how you evaluate the SFT and the baseline model?
  2. I'm using the tau2_sft_merged_v3_rft.jsonl dataset you shared in the cookbook for the SFT, is it the same dataset you use for the SFT?
  3. Also I noticed that in the tau2-bench leaderboard, they use pass^k instead of pass@k, is there any reason you choose pass@k here?

Thank you so much!

@jbarnes850
Copy link
Author

Hi @jbarnes850 , I followed the instructions in your cookbook and trained the model from scratch, here's the result I got:

Stage Overall Airline Retail Telecom
Baseline (Qwen3-4B-Instruct, pass@1) 4% 5% 5% 2.5%
Baseline (Qwen3-4B-Instruct, pass@4) 13% 25% 12.5% 7.5%
SFT1 (pass@1) 28% 10% 40% 25%
SFT1 (pass@4) 55% 30% 72.5% 50%
GRPO (Pass@1) 36% 10% 60% 25%
GRPO (Pass@4) 60% 35% 80% 52.5%
I'm using the eval.py in the PR for the evaluation and using the same args and env variables as you mentioned in the cookbook:

  • --domains airline,retail,telecom
  • --task-split test
  • --num-samples 4
  • --temperature 0.8
  • --top-p 1.0
  • --top-k 20
  • TAU2_USE_COMPRESSED_PROMPTS=0
  • TAU2_USER_MODEL=gpt-4.1-mini
  • TAU2_USER_TEMPERATURE=0.7

And you may find the training logs here

I have a few questions if you don't mind:

  1. The result of SFT has a relatively large difference from what you present in the cookbook, may I ask how you evaluate the SFT and the baseline model?
  2. I'm using the tau2_sft_merged_v3_rft.jsonl dataset you shared in the cookbook for the SFT, is it the same dataset you use for the SFT?
  3. Also I noticed that in the tau2-bench leaderboard, they use pass^k instead of pass@k, is there any reason you choose pass@k here?

Thank you so much!

Hi @zijiexia, thanks for running this and for the detailed table + logs! Awesome work and apologies in advance for the confusion here. Quick clarifications to your questions below:

  1. SFT vs SFT1 + metric:
  • In the cookbook, the “SFT” row refers to the seed SFT stage (pre‑rejection sampling) and was evaluated pass@1 (single‑sample, greedy). Your SFT numbers are higher because you trained on the rejection‑sampling merged dataset and evaluated pass@4, which corresponds to SFT1 (post‑rejection sampling) in the table. So we’re comparing different stages and different metrics.
  • If you want a direct apples‑to‑apples comparison for the “SFT” row, evaluate seed SFT with: --num-samples 1 --temperature 0.0 (greedy, pass@1)
  1. Which dataset is used for SFT?
  • tau2_sft_merged_v3_rft.jsonl is the post‑rejection‑sampling SFT1 dataset.
  • The original seed SFT uses seed_sft_v3.jsonl. So if you used the merged file, you’re effectively running SFT1 (post‑rejection sampling), not the earlier SFT baseline.
  1. pass^k vs pass@k
  • Tau2‑bench’s leaderboard uses pass^k (combinatorial estimate from multiple trials). Our eval script reports pass@k = “any success among k attempts” for simplicity in RL evaluation. If you want leaderboard‑comparable numbers, use tau2‑bench’s pass^k computation (src/tau2/metrics/agent_metrics.py, module tau2.metrics.agent_metrics) or you can run the eval within the Tau2 codebase directly.

@zijiexia
Copy link
Contributor

Hi @jbarnes850 , I followed the instructions in your cookbook and trained the model from scratch, here's the result I got:
Stage Overall Airline Retail Telecom
Baseline (Qwen3-4B-Instruct, pass@1) 4% 5% 5% 2.5%
Baseline (Qwen3-4B-Instruct, pass@4) 13% 25% 12.5% 7.5%
SFT1 (pass@1) 28% 10% 40% 25%
SFT1 (pass@4) 55% 30% 72.5% 50%
GRPO (Pass@1) 36% 10% 60% 25%
GRPO (Pass@4) 60% 35% 80% 52.5%
I'm using the eval.py in the PR for the evaluation and using the same args and env variables as you mentioned in the cookbook:

  • --domains airline,retail,telecom
  • --task-split test
  • --num-samples 4
  • --temperature 0.8
  • --top-p 1.0
  • --top-k 20
  • TAU2_USE_COMPRESSED_PROMPTS=0
  • TAU2_USER_MODEL=gpt-4.1-mini
  • TAU2_USER_TEMPERATURE=0.7

And you may find the training logs here
I have a few questions if you don't mind:

  1. The result of SFT has a relatively large difference from what you present in the cookbook, may I ask how you evaluate the SFT and the baseline model?
  2. I'm using the tau2_sft_merged_v3_rft.jsonl dataset you shared in the cookbook for the SFT, is it the same dataset you use for the SFT?
  3. Also I noticed that in the tau2-bench leaderboard, they use pass^k instead of pass@k, is there any reason you choose pass@k here?

Thank you so much!

Hi @zijiexia, thanks for running this and for the detailed table + logs! Awesome work and apologies in advance for the confusion here. Quick clarifications to your questions below:

  1. SFT vs SFT1 + metric:
  • In the cookbook, the “SFT” row refers to the seed SFT stage (pre‑rejection sampling) and was evaluated pass@1 (single‑sample, greedy). Your SFT numbers are higher because you trained on the rejection‑sampling merged dataset and evaluated pass@4, which corresponds to SFT1 (post‑rejection sampling) in the table. So we’re comparing different stages and different metrics.
  • If you want a direct apples‑to‑apples comparison for the “SFT” row, evaluate seed SFT with: --num-samples 1 --temperature 0.0 (greedy, pass@1)
  1. Which dataset is used for SFT?
  • tau2_sft_merged_v3_rft.jsonl is the post‑rejection‑sampling SFT1 dataset.
  • The original seed SFT uses seed_sft_v3.jsonl. So if you used the merged file, you’re effectively running SFT1 (post‑rejection sampling), not the earlier SFT baseline.
  1. pass^k vs pass@k
  • Tau2‑bench’s leaderboard uses pass^k (combinatorial estimate from multiple trials). Our eval script reports pass@k = “any success among k attempts” for simplicity in RL evaluation. If you want leaderboard‑comparable numbers, use tau2‑bench’s pass^k computation (src/tau2/metrics/agent_metrics.py, module tau2.metrics.agent_metrics) or you can run the eval within the Tau2 codebase directly.

Thanks for the clarification!

@zijiexia
Copy link
Contributor

zijiexia commented Jan 6, 2026

I've rerun the evaluation, here's the results I got:

Settings:

--num-samples 1   
--temperature 1.0
--top-p 1.0
Stage Overall Airline Retail Telecom
Baseline 4% 5% 5% 2.5%
SFT 26% 10% 50% 10%
SFT + RFT 34% 35% 55% 12.5%
SFT + RFT (Jarrodbarnes/Qwen3-4B-tau2-sft1) 31% 40% 50% 7.5%
GRPO 36% 10% 60% 25%
GRPO (Jarrodbarnes/Qwen3-4B-tau2-grpo-v1) 34% 30% 55% 15%

@jbarnes850
Copy link
Author

Thanks so much for the feedback and re-run here! A couple clarifications that should make the comparisons apples-to-apples:

  1. For pass@1 baselines in the cookbook I used greedy decoding (--temperature 0.0) to measure deterministic capability, and for pass@4 I used --num-samples 4 --temperature 0.8 --top-p 1.0 --top-k 20 to evaluate robustness under sampling and solution diversity. Sampling at k=1 will typically lower pass@1 and add variance, which can shift the stage ordering you’re seeing.
  2. eval.py reports pass@k (any success among k), while the official tau2-bench leaderboard uses pass^k. I’ve added a note in the cookbook so it’s explicit (also happy to change it if needed).
  3. The reproduction table uses gpt-4.1-mini as the user simulator; I changed the default in eval.py to gpt-4.1-2025-04-14. If your run used a different user-sim, that also can shift the numbers.

I just re-ran the full eval on an A100 with gpt-4.1-mini and the cookbook pass@4 settings. I get pass@1 = 0.27 and pass@4 = 0.55 on 100 tasks. W&B run: https://wandb.ai/jbarnes850-near-protocol/slime-tau2-eval/runs/2d534fuo

Also addressing the other review feedback from the thread:

  • Hardened tool-call parsing: multiple <tool_call> blocks now error explicitly (prevents partial/incorrect execution).
  • Fixed reward normalization under dropped samples to preserve per-prompt grouping (no silent collapse).
  • Guarded --num-samples >= 1.
  • Clarified docs for GPT-4.1-mini credentials and pass@k vs pass^k.
  • Safer defaults: user-sim server binds to localhost by default; Ray dashboard host defaults to localhost.
  • Added minimal unit tests for multi-tool-call parsing, reward normalization mask, and eval arg guard.

Happy to talk through this further if any other revisions are needed!

@zhuzilin
Copy link
Contributor

Thank you for the PR, but I'm afraid I need to close it as this does not serve as a good example for slime, which need to be simple and show some specific features. This PR could be a really nice isolated repo, similar to Alibaba-NLP/qqr. We'd love to add a reference of this implementation in the readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants