Integrate Terminal Bench Evaluation#1154
Conversation
There was a problem hiding this comment.
Add a model download command and ckpt conversion here.
| delegate: | ||
| - name: terminal_bench | ||
| # type: examples.eval.terminal_bench.tb_config.build_terminal_bench_config | ||
| url: http://172.17.0.1:9052 |
There was a problem hiding this comment.
Add a comment that this port should match with the tb server in host machine
| timeout_secs: 86400 # 24 hours | ||
| max_retries: 1 # HTTP request retries from Slime to the TB server | ||
| model_name: qwen3-8b | ||
| api_base: http://127.0.1.1:30005/v1 |
There was a problem hiding this comment.
Add a comment that this port should match with sglang router port
| max_retries: 1 # HTTP request retries from Slime to the TB server | ||
| model_name: qwen3-8b | ||
| api_base: http://127.0.1.1:30005/v1 | ||
| dataset_path: /mnt/data/xinyu/program/slime-tb/terminal-bench/tasks |
There was a problem hiding this comment.
Comment: This is the dataset path in host machine
There was a problem hiding this comment.
Added this in the quick-start README.
| ray start --head --node-ip-address ${MASTER_ADDR} --port 6380 --num-gpus 2 \ | ||
| --disable-usage-stats \ | ||
| --dashboard-host=0.0.0.0 \ | ||
| --dashboard-port=8266 \ |
There was a problem hiding this comment.
Add comment here. About port conflict
There was a problem hiding this comment.
Added this in the quick-start README.
| @@ -0,0 +1,12 @@ | |||
| # Minimal Terminal Bench delegate config for running on the host (no containers). | |||
There was a problem hiding this comment.
Do we need to keep this?
| --ulimit stack=67108864 \ | ||
| --ulimit nofile=65536:65536 \ | ||
| -v ~/.cache:/root/.cache \ | ||
| -v $(pwd)/slime:/opt/slime \ |
There was a problem hiding this comment.
There is some error when mount /opt folder in slime docker.. change to another path like /shared
There was a problem hiding this comment.
Switched the mount to /shared to avoid /opt issues. Thanks for pointing this out.
|
|
||
|
|
||
| @classmethod | ||
| def parse(cls, args, raw_env_config: Mapping[str, Any], defaults: Mapping[str, Any]) -> TerminalBenchConfig: |
There was a problem hiding this comment.
Is there any better way to impl this?
There was a problem hiding this comment.
Thanks for the suggestion. I refactored the implementation to reduce repetition by using a field to cast mapping with a loop. Please let me know if this looks reasonable.
|
LGTM. Good job. |
|
@zhuzilin Hi Zilin, I think this PR generally looks good with minimum invasions. And we've test its functionality on different machines. Do you have other suggestions? |
- Integrates **Terminal Bench** as an eval delegate for **Slime**, enabling evaluation via an external TB server. - Adds a minimal **smoke eval config** and an example **Qwen3-8B** launch script for quick end-to-end testing. - Provides client/server support for submitting eval jobs, polling status, and collecting metrics from Terminal Bench. Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com>
Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com> Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>
Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu> Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com>
Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com> Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>
Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com> Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>
1fd519d to
98facc6
Compare
⏳ To-do
|
|
I'm not sure if this is a suitable PR for slime... Because it seems mainly an introduction on how to use terminal bench to do evaluation and does not seem to show any special capability of slime. The goal of slime is not to support the evaluation of all main stream benchmarks or recommend certain evaluation pipeline. I'll close this with the same reason as #1025. |
📝 PR Description: Integrate Terminal Bench into Slime
📝 Summary
This PR fully integrates Terminal Bench (TB) into the Slime framework, enabling end-to-end agent evaluation capabilities within the system.
examples/eval/terminal_bench.eval_delegate, ensuring metrics are correctly parsed and reported to W&B.✅ Checklist
tb_server.pyimplemented (Host-side).tb_client.pyimplemented (Container-side).eval_delegate.py.⏳ To-do
The current integration targets TB v1.0 via the
tb runCLI; the workflow will be extended to support TB v2.0 based onharbor run.The TB server currently hard-codes the
terminus-2agent; agent selection will be made configurable to support additional agents.The server currently uses the default
terminal-bench-coredataset; a-d / --datasetargument will be added to enable evaluation on other registered datasets.End-to-end validation has been performed on Qwen3-8B and Qwen3-32B; evaluations will be extended to additional models.
🤝 Collaborators