feat: Add extract_eval_stats.py script for eval statistics extraction #332

geelen · 2025-12-01T04:50:29Z

Summary

I found myself wanting to understand the performance and caching behavior of a benchmark run a bit better. So I had this script made. Right now it's just a standalone script, but it could be integrated into the main app if people found it useful.

What are you adding?

Changes Made

New scripts/extract_eval_stats.py standalone script that pulls info from the log file.

Testing

I have run the existing test suite (pytest)
I have added tests for my changes
I have tested with multiple model providers (if applicable)
I have run pre-commit hooks (pre-commit run --all-files)

Checklist

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (if applicable)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Extract detailed statistics from openbench/Inspect AI evaluation logs (.eval ZIP or .json files).

Total time and working time per sample
LLM HTTP round-trip time (from model call events)
Provider-specific timing for Groq (queue_time, prompt_time, completion_time)
Approximate tool time (working_time - llm_http_time)
LLM time fraction of total working time
Input, output, and total tokens per sample and aggregate
Per-model breakdown from model_usage
OpenAI: prompt_tokens and cached_tokens with cache hit rate
Anthropic: cache_read_input_tokens and cache_creation_input_tokens
Human-readable summary to stdout
JSON output with --output flag
Per-sample statistics with --per-sample flag

python scripts/extract_eval_stats.py logs/my_eval.eval

python scripts/extract_eval_stats.py logs/ -o stats.json --per-sample -v

  Model: groq/openai/gpt-oss-120b
  Task: progressivemcpbench-copilot

  Samples: 130
  Score: mean=0.523 (min=0.000, max=1.000)

  CALLS:
    Model calls: 794
    Tool calls: 590

  TIME BREAKDOWN:
    Total eval time: 1h 46m
    LLM time: 7m 49s (7.3%)
    Tool time (approx): 1h 38m

  LLM CALL PERFORMANCE (avg per call):
    HTTP round-trip: 623ms
    Time to first token (queue): 12ms
    Prompt processing: 55ms
    Completion generation: 307ms

  TOKENS:
    Input: 5.41M
    Output: 124.4K
    Total: 5.54M

  THROUGHPUT:
    Input processing: 74.2K tok/s
    Output generation: 475 tok/s

  CACHING (OpenAI-compatible):
    Prompt tokens: 5.41M
    Cached tokens: 3.94M
    Cache hit rate: 72.8%

Excludes scorer model calls (only counts primary model usage)
All data is extracted from existing eval logs - no runtime changes required
Cache statistics depend on provider support (OpenAI, Anthropic formats)

Amp-Thread-ID: https://ampcode.com/threads/T-2f80b096-55b1-4353-930c-76e9f6368f71

Note

Introduce a standalone script to parse .eval/.json logs and output aggregated/per-sample eval timing, token, throughput, and cache statistics.

CLI (scripts/extract_eval_stats.py):
- Parse .eval (ZIP) and .json logs; auto-discover files in a path.
- Extract per-sample stats (times, token usage), excluding scorer model calls via span hierarchy.
- Aggregate eval metrics: total/working time, LLM HTTP time, tool-time approximation, call counts.
- Compute provider timings (Groq: queue_time, prompt_time, completion_time) and throughput (input/output tok/s).
- Collect cache metrics: OpenAI (prompt_tokens, cached_tokens, hit rate) and Anthropic (cache_read_input_tokens, cache_creation_input_tokens).
- Output human-readable summary to stdout; optional JSON via --output with --per-sample support; verbose mode.

^{Written by Cursor Bugbot for commit ce075e0. This will update automatically on new commits. Configure here.}

Extract detailed statistics from openbench/Inspect AI evaluation logs (.eval ZIP or .json files). - Total time and working time per sample - LLM HTTP round-trip time (from model call events) - Provider-specific timing for Groq (queue_time, prompt_time, completion_time) - Approximate tool time (working_time - llm_http_time) - LLM time fraction of total working time - Input, output, and total tokens per sample and aggregate - Per-model breakdown from model_usage - OpenAI: prompt_tokens and cached_tokens with cache hit rate - Anthropic: cache_read_input_tokens and cache_creation_input_tokens - Human-readable summary to stdout - JSON output with --output flag - Per-sample statistics with --per-sample flag ```bash python scripts/extract_eval_stats.py logs/my_eval.eval python scripts/extract_eval_stats.py logs/ -o stats.json --per-sample -v ``` ``` Model: groq/openai/gpt-oss-120b Task: progressivemcpbench-copilot Samples: 130 Score: mean=0.523 (min=0.000, max=1.000) CALLS: Model calls: 794 Tool calls: 590 TIME BREAKDOWN: Total eval time: 1h 46m LLM time: 7m 49s (7.3%) Tool time (approx): 1h 38m LLM CALL PERFORMANCE (avg per call): HTTP round-trip: 623ms Time to first token (queue): 12ms Prompt processing: 55ms Completion generation: 307ms TOKENS: Input: 5.41M Output: 124.4K Total: 5.54M THROUGHPUT: Input processing: 74.2K tok/s Output generation: 475 tok/s CACHING (OpenAI-compatible): Prompt tokens: 5.41M Cached tokens: 3.94M Cache hit rate: 72.8% ``` - Excludes scorer model calls (only counts primary model usage) - All data is extracted from existing eval logs - no runtime changes required - Cache statistics depend on provider support (OpenAI, Anthropic formats) Amp-Thread-ID: https://ampcode.com/threads/T-2f80b096-55b1-4353-930c-76e9f6368f71 Co-authored-by: Amp <[email protected]>

scripts/extract_eval_stats.py

Co-authored-by: gmaddern <[email protected]>

cursor · 2025-12-01T22:56:40Z

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
_{Learn more about Cursor Agents}

geelen · 2025-12-01T22:58:28Z

@cursor please fix both suggestions.

cursor · 2025-12-01T22:58:34Z

Provider timing averages now weight every individual call rather than averaging per-sample means, so llm_queue/prompt/completion/provider_time_mean mirror the llm_http_time_mean behavior and avoid bias toward short samples.

        all_http_times.extend(s.llm_http_times)
        if s.n_model_calls > 0:
            total_queue_time_sum += s.llm_queue_time_total
            total_prompt_time_sum += s.llm_prompt_time_total
            total_completion_time_sum += s.llm_completion_time_total
            total_provider_time_sum += s.llm_provider_time_total
            total_provider_time_calls += s.n_model_calls
...
            "llm_queue_time_mean": (
                total_queue_time_sum / total_provider_time_calls
                if total_provider_time_calls > 0
                else None
            ),

The P95 calculation now uses an interpolated percentile index (0.95 * (n-1)) so 95th percentile isn’t biased high for small samples.

        total_time_p95 = None
        if len(total_times) >= 20:
            sorted_times = sorted(total_times)
            percentile_position = 0.95 * (len(sorted_times) - 1)
            lower_index = int(percentile_position)
            upper_index = min(lower_index + 1, len(sorted_times) - 1)
            fraction = percentile_position - lower_index
            total_time_p95 = (
                sorted_times[lower_index] * (1 - fraction)
                + sorted_times[upper_index] * fraction
            )
...
            "total_time_p95": total_time_p95,

Let me know if you’d like this scripted into tests or covered elsewhere.

_{Learn more about Cursor Agents}

Co-authored-by: gmaddern <[email protected]>

scripts/extract_eval_stats.py

geelen requested review from AarushSah and nmayorga7 as code owners December 1, 2025 04:50

cursor bot reviewed Dec 1, 2025

View reviewed changes

scripts/extract_eval_stats.py Outdated Show resolved Hide resolved

scripts/extract_eval_stats.py Outdated Show resolved Hide resolved

Refactor LLM timing calculations for accuracy

6071648

Co-authored-by: gmaddern <[email protected]>

Refactor total_time_p95 calculation for accuracy

ce075e0

Co-authored-by: gmaddern <[email protected]>

cursor bot reviewed Dec 1, 2025

View reviewed changes

scripts/extract_eval_stats.py Show resolved Hide resolved

scripts/extract_eval_stats.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add extract_eval_stats.py script for eval statistics extraction #332

feat: Add extract_eval_stats.py script for eval statistics extraction #332

Uh oh!

geelen commented Dec 1, 2025 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

cursor bot commented Dec 1, 2025

Uh oh!

geelen commented Dec 1, 2025

Uh oh!

cursor bot commented Dec 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

feat: Add extract_eval_stats.py script for eval statistics extraction #332

Are you sure you want to change the base?

feat: Add extract_eval_stats.py script for eval statistics extraction #332

Uh oh!

Conversation

geelen commented Dec 1, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What are you adding?

Changes Made

Testing

Checklist

Uh oh!

Uh oh!

Uh oh!

cursor bot commented Dec 1, 2025

Uh oh!

geelen commented Dec 1, 2025

Uh oh!

cursor bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

geelen commented Dec 1, 2025 •

edited by cursor bot

Loading

cursor bot commented Dec 1, 2025 •

edited

Loading