Skip to content

Comments

chore: benchmark routing + tool-call validation#98

Open
saschabuehrle wants to merge 10 commits intomainfrom
feat/benchmark-truthfulqa
Open

chore: benchmark routing + tool-call validation#98
saschabuehrle wants to merge 10 commits intomainfrom
feat/benchmark-truthfulqa

Conversation

@saschabuehrle
Copy link
Collaborator

Summary

  • unify benchmark model overrides + provider/cost resolution
  • enrich benchmark routing metrics + add real-world tool-call benchmark
  • fix full benchmark dataset handling and baseline cost calculations
  • add mmlu benchmark concurrency and fix bfcl parallel tool-call evaluation

Testing

  • python3 -m pytest
  • benchmarks: longbench_full, gsm8k_full, mmlu_full, mtbench_full, bfcl_full, ruler_full, truthfulqa, tool_calls(+agentic/realworld), bfcl agentic, basic_usage (py/ts)

@github-actions github-actions bot added documentation Improvements or additions to documentation lang: python tests core size/xl labels Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core documentation Improvements or additions to documentation lang: python size/xl tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant