chore: benchmark routing + tool-call validation by saschabuehrle · Pull Request #98 · lemony-ai/cascadeflow

saschabuehrle · 2026-02-06T09:21:31Z

Summary

python3 -m pytest
benchmarks: longbench_full, gsm8k_full, mmlu_full, mtbench_full, bfcl_full, ruler_full, truthfulqa, tool_calls(+agentic/realworld), bfcl agentic, basic_usage (py/ts)

saschabuehrle added 9 commits February 4, 2026 19:04

fix: improve direct routing cost tracking

a59d730

chore: enrich benchmark routing metrics

e938a6c

feat: add real-world tool calls benchmark

19be317

fix: download full HumanEval dataset

b1c1691

chore: unify benchmark model overrides

46af724

chore: show resolved models in gsm8k full benchmark

849bae5

fix: correct baseline cost estimation in full benchmarks

fd58a41

chore: add concurrency to mmlu full benchmark

93c3758

fix: handle parallel tool calls in bfcl benchmark

1abc780

github-actions bot added documentation Improvements or additions to documentation lang: python tests core size/xl labels Feb 6, 2026

style(benchmarks): apply black formatting for truthfulqa PR

8a96eb9