Skip to content

Integrate Terminal Bench Evaluation#1154

Closed
XinyuJiangCMU wants to merge 6 commits intoTHUDM:mainfrom
XinyuJiangCMU:feat/tb-eval-integration
Closed

Integrate Terminal Bench Evaluation#1154
XinyuJiangCMU wants to merge 6 commits intoTHUDM:mainfrom
XinyuJiangCMU:feat/tb-eval-integration

Conversation

@XinyuJiangCMU
Copy link

@XinyuJiangCMU XinyuJiangCMU commented Dec 19, 2025

📝 PR Description: Integrate Terminal Bench into Slime

📝 Summary

This PR fully integrates Terminal Bench (TB) into the Slime framework, enabling end-to-end agent evaluation capabilities within the system.

  • Structure: Implemented TB eval server/client and configuration templates under examples/eval/terminal_bench.
  • Pipeline: Successfully hooked TB into eval_delegate, ensuring metrics are correctly parsed and reported to W&B.
  • Docs: Provided comprehensive English and Chinese quickstart guides and example configuration files.

✅ Checklist

  • Server: tb_server.py implemented (Host-side).
  • Client: tb_client.py implemented (Container-side).
  • Networking: Verified Slime-to-Host network connectivity.
  • Integration: Logic wired into eval_delegate.py.
  • Documentation: Updated guide to reflect the Container-Host architecture.
  • Test: End-to-end tests passed.

⏳ To-do

  1. Support Terminal-Bench 2.0
    The current integration targets TB v1.0 via the tb run CLI; the workflow will be extended to support TB v2.0 based on harbor run.
  2. Support configurable agents
    The TB server currently hard-codes the terminus-2 agent; agent selection will be made configurable to support additional agents.
  3. Add dataset selection support to TB server
    The server currently uses the default terminal-bench-core dataset; a -d / --dataset argument will be added to enable evaluation on other registered datasets.
  4. Expand evaluation coverage to more models
    End-to-end validation has been performed on Qwen3-8B and Qwen3-32B; evaluations will be extended to additional models.

🤝 Collaborators

@XinyuJiangCMU XinyuJiangCMU changed the title Add TerminalBench eval delegate + quickstart [WIP] Add TerminalBench eval delegate + quickstart Dec 19, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a model download command and ckpt conversion here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

delegate:
- name: terminal_bench
# type: examples.eval.terminal_bench.tb_config.build_terminal_bench_config
url: http://172.17.0.1:9052
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that this port should match with the tb server in host machine

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

timeout_secs: 86400 # 24 hours
max_retries: 1 # HTTP request retries from Slime to the TB server
model_name: qwen3-8b
api_base: http://127.0.1.1:30005/v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that this port should match with sglang router port

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

max_retries: 1 # HTTP request retries from Slime to the TB server
model_name: qwen3-8b
api_base: http://127.0.1.1:30005/v1
dataset_path: /mnt/data/xinyu/program/slime-tb/terminal-bench/tasks
Copy link
Contributor

@guapisolo guapisolo Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: This is the dataset path in host machine

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this in the quick-start README.

ray start --head --node-ip-address ${MASTER_ADDR} --port 6380 --num-gpus 2 \
--disable-usage-stats \
--dashboard-host=0.0.0.0 \
--dashboard-port=8266 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment here. About port conflict

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this in the quick-start README.

@@ -0,0 +1,12 @@
# Minimal Terminal Bench delegate config for running on the host (no containers).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used anywhere, removed.

--ulimit stack=67108864 \
--ulimit nofile=65536:65536 \
-v ~/.cache:/root/.cache \
-v $(pwd)/slime:/opt/slime \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some error when mount /opt folder in slime docker.. change to another path like /shared

Copy link
Author

@XinyuJiangCMU XinyuJiangCMU Jan 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched the mount to /shared to avoid /opt issues. Thanks for pointing this out.



@classmethod
def parse(cls, args, raw_env_config: Mapping[str, Any], defaults: Mapping[str, Any]) -> TerminalBenchConfig:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any better way to impl this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I refactored the implementation to reduce repetition by using a field to cast mapping with a loop. Please let me know if this looks reasonable.

@XinyuJiangCMU XinyuJiangCMU changed the title [WIP] Add TerminalBench eval delegate + quickstart Integrate Terminal Bench Evaluation Jan 1, 2026
@XinyuJiangCMU XinyuJiangCMU marked this pull request as ready for review January 3, 2026 03:07
@guapisolo
Copy link
Contributor

LGTM. Good job.

@guapisolo
Copy link
Contributor

@zhuzilin Hi Zilin, I think this PR generally looks good with minimum invasions. And we've test its functionality on different machines. Do you have other suggestions?

Xinyu Jiang and others added 6 commits January 14, 2026 01:17
- Integrates **Terminal Bench** as an eval delegate for **Slime**, enabling evaluation via an external TB server.
- Adds a minimal **smoke eval config** and an example **Qwen3-8B** launch script for quick end-to-end testing.
- Provides client/server support for submitting eval jobs, polling status, and collecting metrics from Terminal Bench.

Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com>
Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com>
Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>
Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>
Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com>
Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com>
Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>
Co-authored-by: Zhiyao Jiang <jessicajiang324@gmail.com>
Co-authored-by: Xinyu Jiang <xinyuj2@andrew.cmu.edu>
@JessicaJiang-123 JessicaJiang-123 force-pushed the feat/tb-eval-integration branch from 1fd519d to 98facc6 Compare January 14, 2026 01:26
@JessicaJiang-123
Copy link

⏳ To-do

  1. Support Terminal-Bench 2.0
    The current integration targets TB v1.0 via the tb run CLI; the workflow will be extended to support TB v2.0 based on harbor run.

  2. Support configurable agents
    The TB server currently hard-codes the terminus-2 agent; agent selection will be made configurable to support additional agents.

  3. Add dataset selection support to TB server
    The server currently uses the default terminal-bench-core dataset; a -d / --dataset argument will be added to enable evaluation on other registered datasets.

  4. Expand evaluation coverage to more models
    End-to-end validation has been performed on Qwen3-8B and Qwen3-32B; evaluations will be extended to additional models.

@zhuzilin
Copy link
Contributor

zhuzilin commented Jan 16, 2026

I'm not sure if this is a suitable PR for slime... Because it seems mainly an introduction on how to use terminal bench to do evaluation and does not seem to show any special capability of slime.

The goal of slime is not to support the evaluation of all main stream benchmarks or recommend certain evaluation pipeline.

I'll close this with the same reason as #1025.

@zhuzilin zhuzilin closed this Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants