-
Notifications
You must be signed in to change notification settings - Fork 91
feat: Add extract_eval_stats.py script for eval statistics extraction #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Extract detailed statistics from openbench/Inspect AI evaluation logs (.eval ZIP or .json files).
- Total time and working time per sample
- LLM HTTP round-trip time (from model call events)
- Provider-specific timing for Groq (queue_time, prompt_time, completion_time)
- Approximate tool time (working_time - llm_http_time)
- LLM time fraction of total working time
- Input, output, and total tokens per sample and aggregate
- Per-model breakdown from model_usage
- OpenAI: prompt_tokens and cached_tokens with cache hit rate
- Anthropic: cache_read_input_tokens and cache_creation_input_tokens
- Human-readable summary to stdout
- JSON output with --output flag
- Per-sample statistics with --per-sample flag
```bash
python scripts/extract_eval_stats.py logs/my_eval.eval
python scripts/extract_eval_stats.py logs/ -o stats.json --per-sample -v
```
```
Model: groq/openai/gpt-oss-120b
Task: progressivemcpbench-copilot
Samples: 130
Score: mean=0.523 (min=0.000, max=1.000)
CALLS:
Model calls: 794
Tool calls: 590
TIME BREAKDOWN:
Total eval time: 1h 46m
LLM time: 7m 49s (7.3%)
Tool time (approx): 1h 38m
LLM CALL PERFORMANCE (avg per call):
HTTP round-trip: 623ms
Time to first token (queue): 12ms
Prompt processing: 55ms
Completion generation: 307ms
TOKENS:
Input: 5.41M
Output: 124.4K
Total: 5.54M
THROUGHPUT:
Input processing: 74.2K tok/s
Output generation: 475 tok/s
CACHING (OpenAI-compatible):
Prompt tokens: 5.41M
Cached tokens: 3.94M
Cache hit rate: 72.8%
```
- Excludes scorer model calls (only counts primary model usage)
- All data is extracted from existing eval logs - no runtime changes required
- Cache statistics depend on provider support (OpenAI, Anthropic formats)
Amp-Thread-ID: https://ampcode.com/threads/T-2f80b096-55b1-4353-930c-76e9f6368f71
Co-authored-by: Amp <[email protected]>
Co-authored-by: gmaddern <[email protected]>
|
Cursor Agent can help with this pull request. Just |
|
@cursor please fix both suggestions. |
|
Provider timing averages now weight every individual call rather than averaging per-sample means, so all_http_times.extend(s.llm_http_times)
if s.n_model_calls > 0:
total_queue_time_sum += s.llm_queue_time_total
total_prompt_time_sum += s.llm_prompt_time_total
total_completion_time_sum += s.llm_completion_time_total
total_provider_time_sum += s.llm_provider_time_total
total_provider_time_calls += s.n_model_calls
...
"llm_queue_time_mean": (
total_queue_time_sum / total_provider_time_calls
if total_provider_time_calls > 0
else None
),The P95 calculation now uses an interpolated percentile index ( total_time_p95 = None
if len(total_times) >= 20:
sorted_times = sorted(total_times)
percentile_position = 0.95 * (len(sorted_times) - 1)
lower_index = int(percentile_position)
upper_index = min(lower_index + 1, len(sorted_times) - 1)
fraction = percentile_position - lower_index
total_time_p95 = (
sorted_times[lower_index] * (1 - fraction)
+ sorted_times[upper_index] * fraction
)
...
"total_time_p95": total_time_p95,Let me know if you’d like this scripted into tests or covered elsewhere. Learn more about Cursor Agents |
Co-authored-by: gmaddern <[email protected]>
Summary
I found myself wanting to understand the performance and caching behavior of a benchmark run a bit better. So I had this script made. Right now it's just a standalone script, but it could be integrated into the main app if people found it useful.
What are you adding?
Changes Made
New
scripts/extract_eval_stats.pystandalone script that pulls info from the log file.Testing
pytest)pre-commit run --all-files)Checklist
Extract detailed statistics from openbench/Inspect AI evaluation logs (.eval ZIP or .json files).
Total time and working time per sample
LLM HTTP round-trip time (from model call events)
Provider-specific timing for Groq (queue_time, prompt_time, completion_time)
Approximate tool time (working_time - llm_http_time)
LLM time fraction of total working time
Input, output, and total tokens per sample and aggregate
Per-model breakdown from model_usage
OpenAI: prompt_tokens and cached_tokens with cache hit rate
Anthropic: cache_read_input_tokens and cache_creation_input_tokens
Human-readable summary to stdout
JSON output with --output flag
Per-sample statistics with --per-sample flag
Amp-Thread-ID: https://ampcode.com/threads/T-2f80b096-55b1-4353-930c-76e9f6368f71
Note
Introduce a standalone script to parse .eval/.json logs and output aggregated/per-sample eval timing, token, throughput, and cache statistics.
scripts/extract_eval_stats.py):.eval(ZIP) and.jsonlogs; auto-discover files in a path.queue_time,prompt_time,completion_time) and throughput (input/output tok/s).prompt_tokens,cached_tokens, hit rate) and Anthropic (cache_read_input_tokens,cache_creation_input_tokens).--outputwith--per-samplesupport; verbose mode.Written by Cursor Bugbot for commit ce075e0. This will update automatically on new commits. Configure here.