Add tau2-bench training cookbook and implementation by jbarnes850 · Pull Request #1156 · THUDM/slime

jbarnes850 · 2025-12-19T19:50:20Z

This PR adds a complete implementation and training cookbook for tau2-bench, demonstrating how to train multi-turn tool-use agents using the SFT → RFT → GRPO pipeline.

Key additions

Complete tau2-bench training cookbook with methodology deep-dive
Tau2 implementation (rollout, eval, reward shaping, actions)
Training scripts for SFT and GRPO with shaped rewards
Unified eval.py supporting Pass@1 (greedy) and Pass@K (multi-sampling)
Performance results: 57.1% Pass@4 (4× baseline improvement)

Resources

Training data: tau2-sft-seed-v3
Checkpoints: Qwen3-4B-tau2-sft1, Qwen3-4B-tau2-grpo-v1
WandB logs: Full training metrics and sample outputs

Highlights

The cookbook explains credit assignment in multi-turn tasks, progressive training benefits, and turn-level reward shaping.

This PR adds a complete implementation and training cookbook for tau2-bench, demonstrating how to train multi-turn tool-use agents using the SFT → RFT → GRPO pipeline. Key additions: - Complete tau2-bench training cookbook with methodology deep-dive - Tau2 implementation (rollout, eval, reward shaping, actions) - Training scripts for SFT and GRPO with shaped rewards - Unified eval.py supporting Pass@1 (greedy) and Pass@K (multi-sampling) - Performance results: 57.1% Pass@4 (4× baseline improvement) Resources: - Training data: tau2-sft-seed-v3 (~3K filtered trajectories) - Checkpoints: Qwen3-4B-tau2-sft1, Qwen3-4B-tau2-grpo-v1 - WandB logs: Full training metrics and sample outputs The cookbook explains credit assignment in multi-turn tasks, progressive training benefits, and turn-level reward shaping with research citations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Standardize on Qwen3 native function calling format only - Remove legacy action format support - Simplify action parsing and observation formatting - Clean up reward shaping and prompting utilities - Reduce code complexity across tau2 modules This commit reduces code by 1,369 lines while preserving all functionality.

Copilot

Pull request overview

This PR adds a comprehensive implementation and training cookbook for tau2-bench, demonstrating how to train multi-turn tool-use agents using a progressive SFT → RFT → GRPO pipeline. The implementation achieves 57.1% Pass@4 on tau2-bench (4× baseline improvement) with a 4B parameter model.

Key Changes:

Complete tau2-bench training pipeline with SFT, rejection sampling (RFT), and GRPO stages
Unified evaluation harness supporting both Pass@1 (greedy) and Pass@K (multi-sampling) metrics
Dense reward shaping using turn-level partial scores with domain-adaptive weighting

Reviewed changes

Copilot reviewed 16 out of 23 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
examples/tau-bench/training_cookbook.md	Comprehensive training guide with methodology, performance results, and reproduction instructions
examples/tau-bench/tau2/tasks.py	Task preprocessing script to generate JSONL index files for training
examples/tau-bench/tau2/rollout.py	Custom rollout function for GRPO with tau2-bench environment integration
examples/tau-bench/tau2/reward.py	Reward shaping implementation with domain-adaptive alpha and curriculum learning
examples/tau-bench/tau2/prompting.py	Compressed system prompts for reduced KV cache pressure during RL training
examples/tau-bench/tau2/actions.py	Action parsing supporting both native FC and legacy formats with robust error handling
examples/tau-bench/tau2/env.py	Environment wrapper utilities for tau2-bench with partial score computation
examples/tau-bench/tau2/eval.py	Unified evaluation script supporting Pass@K sampling with WandB/Weave integration
examples/tau-bench/tau2/run_sft.sh	SFT training script using filtered trajectories from rejection sampling
examples/tau-bench/tau2/run_grpo.sh	GRPO training script with shaped rewards and curriculum learning
examples/tau-bench/tau2/start_user_sim_server.sh	User simulator server startup script for multi-turn RL rollouts
examples/tau-bench/tau2/.env.template	Environment variable template for API keys and configuration
examples/tau-bench/tau2/README.md	Component overview and usage instructions
examples/tau-bench/README.md	Updated main README with tau1/tau2 benchmark comparison
examples/tau-bench/.gitignore	Ignore patterns for outputs and local files
examples/tau-bench/tau1/*	Legacy tau1 implementation files (context)

Comments suppressed due to low confidence (4)

examples/tau-bench/tau2/eval.py:438

The default user model is set to "gpt-4.1-mini" which does not exist. OpenAI's model naming convention is "gpt-4o-mini" or "gpt-4-turbo". The version "4.1" is not a valid OpenAI model identifier.
examples/tau-bench/tau2/env.py:192
The variable name "denom" is abbreviated and unclear. Consider using "total_weight" or "weight_sum" for better code readability.
examples/tau-bench/tau2/reward.py:324
The WandB logging code accesses _curriculum_tracker._lock directly (line 302), which breaks encapsulation and could lead to maintenance issues. The underscore prefix indicates this is a private attribute that shouldn't be accessed outside the class. Consider adding a public method to the _TaskCurriculumTracker class that provides the needed statistics in a thread-safe manner.
examples/tau-bench/tau2/reward.py:327
'except' clause does nothing but pass and there is no explanatory comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

examples/tau-bench/tau2/actions.py

examples/tau-bench/tau2/run_grpo.sh

examples/tau-bench/tau2/reward.py

examples/tau-bench/tau2/rollout.py

examples/tau-bench/training_cookbook.md

examples/tau-bench/tau2/run_sft.sh

examples/tau-bench/tau2/run_grpo.sh

Updated the README to improve clarity and formatting.

Fengzdadi · 2025-12-21T07:46:10Z

This is awesome — basically a Tau2 mega-pack 🚀
I also added a small QoL patch in #1158 (offline stub user + tool parser fallback) to make tau-bench runnable without external API keys for debugging/logging. Might be a useful reference.

jbarnes850 · 2025-12-22T18:36:11Z

This is awesome — basically a Tau2 mega-pack 🚀 I also added a small QoL patch in #1158 (offline stub user + tool parser fallback) to make tau-bench runnable without external API keys for debugging/logging. Might be a useful reference.

@Fengzdadi Ah, thank you! Very much appreciate the patch!

maocheng23 · 2025-12-22T21:55:41Z

Great work on this! I had a couple of quick questions, if you don’t mind:

noticed that the training wandb logs are not public yet. Would it be possible to grant access? I’d be very interested in taking a look at the training curves.
Would you be open to sharing the Stage 1 SFT training dataset as well?

- Fix tau2 SFT dataset defaults and document exact file selection - Make GRPO WandB logging optional to match docs - Clarify GPU allocation, eval temperature, and TAU2_DATA_DIR in cookbook - Add tau1 episode logging and resilient tool parsing for offline runs

- Bind Ray dashboard to localhost by default - Clarify public HF checkpoints in cookbook - Note tau1 stub provider for offline debugging

- Fix tau2 SFT dataset defaults and document exact file selection - Make GRPO WandB logging optional to match docs - Clarify GPU allocation, eval temperature, and TAU2_DATA_DIR in cookbook - Add tau1 episode logging and resilient tool parsing for offline runs

- Bind Ray dashboard to localhost by default - Clarify public HF checkpoints in cookbook - Note tau1 stub provider for offline debugging

jbarnes850 · 2025-12-28T13:50:31Z

Great work on this! I had a couple of quick questions, if you don’t mind:

noticed that the training wandb logs are not public yet. Would it be possible to grant access? I’d be very interested in taking a look at the training curves.

Would you be open to sharing the Stage 1 SFT training dataset as well?

@maocheng23 My apologies! I thought they were pubic:

The W&B project is now public and contains the SFT + GRPO v1 runs: https://wandb.ai/jbarnes850-near-protocol/tau2-cookbook.
The Stage 1 SFT dataset is also public here: https://huggingface.co/datasets/Jarrodbarnes/tau2-sft-seed-v3
I re-ran the eval to ensure full reproducibility - here is the exact config that enables the 57% pass @ 4 result (also listed in the cookbook now)
- hf-checkpoint Jarrodbarnes/Qwen3-4B-tau2-grpo-v1
- tau2-bench commit: 337326e62d8e0ca74c353b004a9c5d748e0ba914
Eval command hyperparameters
- --domains airline,retail,telecom
- --task-split test
- --num-samples 4
- --temperature 0.8
- --top-p 1.0
- --top-k 20
Env variables
- TAU2_USE_COMPRESSED_PROMPTS=0
- TAU2_USER_MODEL=gpt-4.1-mini
- TAU2_USER_TEMPERATURE=0.7
Policy server
- --model-path Jarrodbarnes/Qwen3-4B-tau2-grpo-v1
- --tp 1
- --mem-fraction-static 0.70
- --port 30000

zijiexia · 2025-12-31T05:29:43Z

Hi @jbarnes850 , I followed the instructions in your cookbook and trained the model from scratch, here's the result I got:

Stage	Overall	Airline	Retail	Telecom
Baseline (Qwen3-4B-Instruct, pass@1)	4%	5%	5%	2.5%
Baseline (Qwen3-4B-Instruct, pass@4)	13%	25%	12.5%	7.5%
SFT1 (pass@1)	28%	10%	40%	25%
SFT1 (pass@4)	55%	30%	72.5%	50%
GRPO (Pass@1)	36%	10%	60%	25%
GRPO (Pass@4)	60%	35%	80%	52.5%

I'm using the eval.py in the PR for the evaluation and using the same args and env variables as you mentioned in the cookbook:

--domains airline,retail,telecom
--task-split test
--num-samples 4
--temperature 0.8
--top-p 1.0
--top-k 20
TAU2_USE_COMPRESSED_PROMPTS=0
TAU2_USER_MODEL=gpt-4.1-mini
TAU2_USER_TEMPERATURE=0.7

And you may find the training logs here

I have a few questions if you don't mind:

The result of SFT has a relatively large difference from what you present in the cookbook, may I ask how you evaluate the SFT and the baseline model?
I'm using the tau2_sft_merged_v3_rft.jsonl dataset you shared in the cookbook for the SFT, is it the same dataset you use for the SFT?
Also I noticed that in the tau2-bench leaderboard, they use pass^k instead of pass@k, is there any reason you choose pass@k here?

Thank you so much!

jbarnes850 · 2025-12-31T18:27:42Z

Hi @jbarnes850 , I followed the instructions in your cookbook and trained the model from scratch, here's the result I got:

Stage Overall Airline Retail Telecom
Baseline (Qwen3-4B-Instruct, pass@1) 4% 5% 5% 2.5%
Baseline (Qwen3-4B-Instruct, pass@4) 13% 25% 12.5% 7.5%
SFT1 (pass@1) 28% 10% 40% 25%
SFT1 (pass@4) 55% 30% 72.5% 50%
GRPO (Pass@1) 36% 10% 60% 25%
GRPO (Pass@4) 60% 35% 80% 52.5%
I'm using the eval.py in the PR for the evaluation and using the same args and env variables as you mentioned in the cookbook:

--domains airline,retail,telecom

--task-split test

--num-samples 4

--temperature 0.8

--top-p 1.0

--top-k 20

TAU2_USE_COMPRESSED_PROMPTS=0

TAU2_USER_MODEL=gpt-4.1-mini

TAU2_USER_TEMPERATURE=0.7

And you may find the training logs here

I have a few questions if you don't mind:

The result of SFT has a relatively large difference from what you present in the cookbook, may I ask how you evaluate the SFT and the baseline model?

I'm using the tau2_sft_merged_v3_rft.jsonl dataset you shared in the cookbook for the SFT, is it the same dataset you use for the SFT?

Also I noticed that in the tau2-bench leaderboard, they use pass^k instead of pass@k, is there any reason you choose pass@k here?

Thank you so much!

Hi @zijiexia, thanks for running this and for the detailed table + logs! Awesome work and apologies in advance for the confusion here. Quick clarifications to your questions below:

SFT vs SFT1 + metric:

In the cookbook, the “SFT” row refers to the seed SFT stage (pre‑rejection sampling) and was evaluated pass@1 (single‑sample, greedy). Your SFT numbers are higher because you trained on the rejection‑sampling merged dataset and evaluated pass@4, which corresponds to SFT1 (post‑rejection sampling) in the table. So we’re comparing different stages and different metrics.
If you want a direct apples‑to‑apples comparison for the “SFT” row, evaluate seed SFT with: --num-samples 1 --temperature 0.0 (greedy, pass@1)

Which dataset is used for SFT?

tau2_sft_merged_v3_rft.jsonl is the post‑rejection‑sampling SFT1 dataset.
The original seed SFT uses seed_sft_v3.jsonl. So if you used the merged file, you’re effectively running SFT1 (post‑rejection sampling), not the earlier SFT baseline.

pass^k vs pass@k

Tau2‑bench’s leaderboard uses pass^k (combinatorial estimate from multiple trials). Our eval script reports pass@k = “any success among k attempts” for simplicity in RL evaluation. If you want leaderboard‑comparable numbers, use tau2‑bench’s pass^k computation (src/tau2/metrics/agent_metrics.py, module tau2.metrics.agent_metrics) or you can run the eval within the Tau2 codebase directly.

zijiexia · 2025-12-31T18:34:20Z

Hi @jbarnes850 , I followed the instructions in your cookbook and trained the model from scratch, here's the result I got:
Stage Overall Airline Retail Telecom
Baseline (Qwen3-4B-Instruct, pass@1) 4% 5% 5% 2.5%
Baseline (Qwen3-4B-Instruct, pass@4) 13% 25% 12.5% 7.5%
SFT1 (pass@1) 28% 10% 40% 25%
SFT1 (pass@4) 55% 30% 72.5% 50%
GRPO (Pass@1) 36% 10% 60% 25%
GRPO (Pass@4) 60% 35% 80% 52.5%
I'm using the eval.py in the PR for the evaluation and using the same args and env variables as you mentioned in the cookbook:

--domains airline,retail,telecom

--task-split test

--num-samples 4

--temperature 0.8

--top-p 1.0

--top-k 20

TAU2_USE_COMPRESSED_PROMPTS=0

TAU2_USER_MODEL=gpt-4.1-mini

TAU2_USER_TEMPERATURE=0.7

And you may find the training logs here
I have a few questions if you don't mind:

The result of SFT has a relatively large difference from what you present in the cookbook, may I ask how you evaluate the SFT and the baseline model?

I'm using the tau2_sft_merged_v3_rft.jsonl dataset you shared in the cookbook for the SFT, is it the same dataset you use for the SFT?

Also I noticed that in the tau2-bench leaderboard, they use pass^k instead of pass@k, is there any reason you choose pass@k here?

Thank you so much!

Hi @zijiexia, thanks for running this and for the detailed table + logs! Awesome work and apologies in advance for the confusion here. Quick clarifications to your questions below:

SFT vs SFT1 + metric:

In the cookbook, the “SFT” row refers to the seed SFT stage (pre‑rejection sampling) and was evaluated pass@1 (single‑sample, greedy). Your SFT numbers are higher because you trained on the rejection‑sampling merged dataset and evaluated pass@4, which corresponds to SFT1 (post‑rejection sampling) in the table. So we’re comparing different stages and different metrics.

If you want a direct apples‑to‑apples comparison for the “SFT” row, evaluate seed SFT with: --num-samples 1 --temperature 0.0 (greedy, pass@1)

Which dataset is used for SFT?

tau2_sft_merged_v3_rft.jsonl is the post‑rejection‑sampling SFT1 dataset.

The original seed SFT uses seed_sft_v3.jsonl. So if you used the merged file, you’re effectively running SFT1 (post‑rejection sampling), not the earlier SFT baseline.

pass^k vs pass@k

Tau2‑bench’s leaderboard uses pass^k (combinatorial estimate from multiple trials). Our eval script reports pass@k = “any success among k attempts” for simplicity in RL evaluation. If you want leaderboard‑comparable numbers, use tau2‑bench’s pass^k computation (src/tau2/metrics/agent_metrics.py, module tau2.metrics.agent_metrics) or you can run the eval within the Tau2 codebase directly.

Thanks for the clarification!

zijiexia · 2026-01-06T06:07:23Z

I've rerun the evaluation, here's the results I got:

Settings:

--num-samples 1   
--temperature 1.0
--top-p 1.0

Stage	Overall	Airline	Retail	Telecom
Baseline	4%	5%	5%	2.5%
SFT	26%	10%	50%	10%
SFT + RFT	34%	35%	55%	12.5%
SFT + RFT (Jarrodbarnes/Qwen3-4B-tau2-sft1)	31%	40%	50%	7.5%
GRPO	36%	10%	60%	25%
GRPO (Jarrodbarnes/Qwen3-4B-tau2-grpo-v1)	34%	30%	55%	15%

jbarnes850 · 2026-01-08T16:08:23Z

Thanks so much for the feedback and re-run here! A couple clarifications that should make the comparisons apples-to-apples:

For pass@1 baselines in the cookbook I used greedy decoding (--temperature 0.0) to measure deterministic capability, and for pass@4 I used --num-samples 4 --temperature 0.8 --top-p 1.0 --top-k 20 to evaluate robustness under sampling and solution diversity. Sampling at k=1 will typically lower pass@1 and add variance, which can shift the stage ordering you’re seeing.
eval.py reports pass@k (any success among k), while the official tau2-bench leaderboard uses pass^k. I’ve added a note in the cookbook so it’s explicit (also happy to change it if needed).
The reproduction table uses gpt-4.1-mini as the user simulator; I changed the default in eval.py to gpt-4.1-2025-04-14. If your run used a different user-sim, that also can shift the numbers.

I just re-ran the full eval on an A100 with gpt-4.1-mini and the cookbook pass@4 settings. I get pass@1 = 0.27 and pass@4 = 0.55 on 100 tasks. W&B run: https://wandb.ai/jbarnes850-near-protocol/slime-tau2-eval/runs/2d534fuo

Also addressing the other review feedback from the thread:

Hardened tool-call parsing: multiple <tool_call> blocks now error explicitly (prevents partial/incorrect execution).
Fixed reward normalization under dropped samples to preserve per-prompt grouping (no silent collapse).
Guarded --num-samples >= 1.
Clarified docs for GPT-4.1-mini credentials and pass@k vs pass^k.
Safer defaults: user-sim server binds to localhost by default; Ray dashboard host defaults to localhost.
Added minimal unit tests for multi-tool-call parsing, reward normalization mask, and eval arg guard.

Happy to talk through this further if any other revisions are needed!

zhuzilin · 2026-01-16T13:45:40Z

Thank you for the PR, but I'm afraid I need to close it as this does not serve as a good example for slime, which need to be simple and show some specific features. This PR could be a really nice isolated repo, similar to Alibaba-NLP/qqr. We'd love to add a reference of this implementation in the readme.

Copilot AI review requested due to automatic review settings December 19, 2025 19:50

Copilot started reviewing on behalf of jbarnes850 December 19, 2025 19:50 View session

jbarnes850 and others added 2 commits December 19, 2025 14:53

jbarnes850 force-pushed the tau2-sft-rl-pipeline branch from f56c41a to 10f5405 Compare December 19, 2025 19:53

Copilot AI reviewed Dec 19, 2025

View reviewed changes

jbarnes850 and others added 3 commits December 19, 2025 14:59

Cleaned up documentation

8e88845

Revise README for Tau-Bench benchmarks

a01daa0

Updated the README to improve clarity and formatting.

tau-bench: polish docs and cleanup

caf133b

tau2: add TLDR, fix eval user sim docs, complete train prerequisites

d9dceb2

jbarnes850 added 9 commits December 27, 2025 11:49

Harden tau-bench defaults and clarify public artifacts

497d2d1

- Bind Ray dashboard to localhost by default - Clarify public HF checkpoints in cookbook - Note tau1 stub provider for offline debugging

Harden tau-bench defaults and clarify public artifacts

2caa615

- Bind Ray dashboard to localhost by default - Clarify public HF checkpoints in cookbook - Note tau1 stub provider for offline debugging

Clarify tau2 cookbook reproducibility settings

4ec02c9

Merge fork/tau2-sft-rl-pipeline

8597706

Clarify public W&B run scope

6f664b7

Align W&B references to public runs

255cecb

Clarify single-GPU TP setting

f265d37

Harden tau2 eval and reproducibility

2eef49b

Align tau2 parsing and defaults

933ce6e

zhuzilin closed this Jan 16, 2026

zhuzilin mentioned this pull request Jan 16, 2026

Support Tau2 bench evalutation #1025

Closed

Conversation

jbarnes850 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key additions

Resources

Highlights

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fengzdadi commented Dec 21, 2025

Uh oh!

jbarnes850 commented Dec 22, 2025

Uh oh!

maocheng23 commented Dec 22, 2025

Uh oh!

jbarnes850 commented Dec 28, 2025

Uh oh!

zijiexia commented Dec 31, 2025

Uh oh!

jbarnes850 commented Dec 31, 2025

Uh oh!

zijiexia commented Dec 31, 2025

Uh oh!

zijiexia commented Jan 6, 2026

Uh oh!

jbarnes850 commented Jan 8, 2026

Uh oh!

zhuzilin commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jbarnes850 commented Dec 19, 2025 •

edited

Loading