Skip to content

Conversation

@geelen
Copy link
Contributor

@geelen geelen commented Dec 1, 2025

Summary

I found myself wanting to understand the performance and caching behavior of a benchmark run a bit better. So I had this script made. Right now it's just a standalone script, but it could be integrated into the main app if people found it useful.

What are you adding?

  • Bug fix (non-breaking change which fixes an issue)
  • New benchmark/evaluation
  • New model provider
  • CLI enhancement
  • Performance improvement
  • Documentation update
  • API/SDK feature
  • Integration (CI/CD, tools)
  • Export/import functionality
  • Code refactoring
  • Breaking change
  • Other

Changes Made

New scripts/extract_eval_stats.py standalone script that pulls info from the log file.

Testing

  • I have run the existing test suite (pytest)
  • I have added tests for my changes
  • I have tested with multiple model providers (if applicable)
  • I have run pre-commit hooks (pre-commit run --all-files)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (if applicable)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Extract detailed statistics from openbench/Inspect AI evaluation logs (.eval ZIP or .json files).

  • Total time and working time per sample

  • LLM HTTP round-trip time (from model call events)

  • Provider-specific timing for Groq (queue_time, prompt_time, completion_time)

  • Approximate tool time (working_time - llm_http_time)

  • LLM time fraction of total working time

  • Input, output, and total tokens per sample and aggregate

  • Per-model breakdown from model_usage

  • OpenAI: prompt_tokens and cached_tokens with cache hit rate

  • Anthropic: cache_read_input_tokens and cache_creation_input_tokens

  • Human-readable summary to stdout

  • JSON output with --output flag

  • Per-sample statistics with --per-sample flag

python scripts/extract_eval_stats.py logs/my_eval.eval

python scripts/extract_eval_stats.py logs/ -o stats.json --per-sample -v
  Model: groq/openai/gpt-oss-120b
  Task: progressivemcpbench-copilot

  Samples: 130
  Score: mean=0.523 (min=0.000, max=1.000)

  CALLS:
    Model calls: 794
    Tool calls: 590

  TIME BREAKDOWN:
    Total eval time: 1h 46m
    LLM time: 7m 49s (7.3%)
    Tool time (approx): 1h 38m

  LLM CALL PERFORMANCE (avg per call):
    HTTP round-trip: 623ms
    Time to first token (queue): 12ms
    Prompt processing: 55ms
    Completion generation: 307ms

  TOKENS:
    Input: 5.41M
    Output: 124.4K
    Total: 5.54M

  THROUGHPUT:
    Input processing: 74.2K tok/s
    Output generation: 475 tok/s

  CACHING (OpenAI-compatible):
    Prompt tokens: 5.41M
    Cached tokens: 3.94M
    Cache hit rate: 72.8%
  • Excludes scorer model calls (only counts primary model usage)
  • All data is extracted from existing eval logs - no runtime changes required
  • Cache statistics depend on provider support (OpenAI, Anthropic formats)

Amp-Thread-ID: https://ampcode.com/threads/T-2f80b096-55b1-4353-930c-76e9f6368f71


Note

Introduce a standalone script to parse .eval/.json logs and output aggregated/per-sample eval timing, token, throughput, and cache statistics.

  • CLI (scripts/extract_eval_stats.py):
    • Parse .eval (ZIP) and .json logs; auto-discover files in a path.
    • Extract per-sample stats (times, token usage), excluding scorer model calls via span hierarchy.
    • Aggregate eval metrics: total/working time, LLM HTTP time, tool-time approximation, call counts.
    • Compute provider timings (Groq: queue_time, prompt_time, completion_time) and throughput (input/output tok/s).
    • Collect cache metrics: OpenAI (prompt_tokens, cached_tokens, hit rate) and Anthropic (cache_read_input_tokens, cache_creation_input_tokens).
    • Output human-readable summary to stdout; optional JSON via --output with --per-sample support; verbose mode.

Written by Cursor Bugbot for commit ce075e0. This will update automatically on new commits. Configure here.

Extract detailed statistics from openbench/Inspect AI evaluation logs (.eval ZIP or .json files).

- Total time and working time per sample
- LLM HTTP round-trip time (from model call events)
- Provider-specific timing for Groq (queue_time, prompt_time, completion_time)
- Approximate tool time (working_time - llm_http_time)
- LLM time fraction of total working time

- Input, output, and total tokens per sample and aggregate
- Per-model breakdown from model_usage

- OpenAI: prompt_tokens and cached_tokens with cache hit rate
- Anthropic: cache_read_input_tokens and cache_creation_input_tokens

- Human-readable summary to stdout
- JSON output with --output flag
- Per-sample statistics with --per-sample flag

```bash
python scripts/extract_eval_stats.py logs/my_eval.eval

python scripts/extract_eval_stats.py logs/ -o stats.json --per-sample -v
```

```
  Model: groq/openai/gpt-oss-120b
  Task: progressivemcpbench-copilot

  Samples: 130
  Score: mean=0.523 (min=0.000, max=1.000)

  CALLS:
    Model calls: 794
    Tool calls: 590

  TIME BREAKDOWN:
    Total eval time: 1h 46m
    LLM time: 7m 49s (7.3%)
    Tool time (approx): 1h 38m

  LLM CALL PERFORMANCE (avg per call):
    HTTP round-trip: 623ms
    Time to first token (queue): 12ms
    Prompt processing: 55ms
    Completion generation: 307ms

  TOKENS:
    Input: 5.41M
    Output: 124.4K
    Total: 5.54M

  THROUGHPUT:
    Input processing: 74.2K tok/s
    Output generation: 475 tok/s

  CACHING (OpenAI-compatible):
    Prompt tokens: 5.41M
    Cached tokens: 3.94M
    Cache hit rate: 72.8%
```

- Excludes scorer model calls (only counts primary model usage)
- All data is extracted from existing eval logs - no runtime changes required
- Cache statistics depend on provider support (OpenAI, Anthropic formats)

Amp-Thread-ID: https://ampcode.com/threads/T-2f80b096-55b1-4353-930c-76e9f6368f71
Co-authored-by: Amp <[email protected]>
@cursor
Copy link

cursor bot commented Dec 1, 2025

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@geelen
Copy link
Contributor Author

geelen commented Dec 1, 2025

@cursor please fix both suggestions.

@cursor
Copy link

cursor bot commented Dec 1, 2025

Provider timing averages now weight every individual call rather than averaging per-sample means, so llm_queue/prompt/completion/provider_time_mean mirror the llm_http_time_mean behavior and avoid bias toward short samples.

        all_http_times.extend(s.llm_http_times)
        if s.n_model_calls > 0:
            total_queue_time_sum += s.llm_queue_time_total
            total_prompt_time_sum += s.llm_prompt_time_total
            total_completion_time_sum += s.llm_completion_time_total
            total_provider_time_sum += s.llm_provider_time_total
            total_provider_time_calls += s.n_model_calls
...
            "llm_queue_time_mean": (
                total_queue_time_sum / total_provider_time_calls
                if total_provider_time_calls > 0
                else None
            ),

The P95 calculation now uses an interpolated percentile index (0.95 * (n-1)) so 95th percentile isn’t biased high for small samples.

        total_time_p95 = None
        if len(total_times) >= 20:
            sorted_times = sorted(total_times)
            percentile_position = 0.95 * (len(sorted_times) - 1)
            lower_index = int(percentile_position)
            upper_index = min(lower_index + 1, len(sorted_times) - 1)
            fraction = percentile_position - lower_index
            total_time_p95 = (
                sorted_times[lower_index] * (1 - fraction)
                + sorted_times[upper_index] * fraction
            )
...
            "total_time_p95": total_time_p95,

Let me know if you’d like this scripted into tests or covered elsewhere.


Open in Cursor Open in Web

Learn more about Cursor Agents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants