Skip to content

feat: low-memory mode and live progress-bar stats #55

@acere

Description

@acere

Summary

For large-scale test runs, keeping all InvocationResponse objects in memory can be expensive — the bulk of the cost comes from response_text, input_payload, and input_prompt fields.

This issue proposes two related features:

1. Low-memory mode (low_memory=True)

A new low_memory parameter on Runner/run() that writes responses to disk as they arrive but does not accumulate them in the in-memory _responses list. Stats are computed incrementally. Requires output_path to be set. Responses can be loaded on demand via result.load_responses().

result = await runner.run(output_path="/tmp/my_run", low_memory=True)
result.stats          # works (computed incrementally)
result.responses      # [] (empty — not in memory)
result.load_responses()  # loads from disk when needed

2. RunningStats accumulator

A new RunningStats class that tracks metrics incrementally (counts, sums, sorted value lists for percentile computation). This replaces the _builtin_stats @cached_property on Result — stats are now always computed during the run and stored as _preloaded_stats, eliminating redundant recomputation.

3. Live progress-bar stats

RunningStats.snapshot() formats a configurable subset of live stats for tqdm display during the run: p50/p90 TTFT, p50/p90 TTLT, median output tokens/s, total input/output tokens, and failure count. Configurable via the progress_bar_stats parameter.

result = await runner.run(
    progress_bar_stats={
        "p99_ttlt": ("time_to_last_token", "p99"),
        "tps": ("time_per_output_token", "p50", "inv"),
        "fail": "failed",
    },
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions