Summary
For large-scale test runs, keeping all InvocationResponse objects in memory can be expensive — the bulk of the cost comes from response_text, input_payload, and input_prompt fields.
This issue proposes two related features:
1. Low-memory mode (low_memory=True)
A new low_memory parameter on Runner/run() that writes responses to disk as they arrive but does not accumulate them in the in-memory _responses list. Stats are computed incrementally. Requires output_path to be set. Responses can be loaded on demand via result.load_responses().
result = await runner.run(output_path="/tmp/my_run", low_memory=True)
result.stats # works (computed incrementally)
result.responses # [] (empty — not in memory)
result.load_responses() # loads from disk when needed
2. RunningStats accumulator
A new RunningStats class that tracks metrics incrementally (counts, sums, sorted value lists for percentile computation). This replaces the _builtin_stats @cached_property on Result — stats are now always computed during the run and stored as _preloaded_stats, eliminating redundant recomputation.
3. Live progress-bar stats
RunningStats.snapshot() formats a configurable subset of live stats for tqdm display during the run: p50/p90 TTFT, p50/p90 TTLT, median output tokens/s, total input/output tokens, and failure count. Configurable via the progress_bar_stats parameter.
result = await runner.run(
progress_bar_stats={
"p99_ttlt": ("time_to_last_token", "p99"),
"tps": ("time_per_output_token", "p50", "inv"),
"fail": "failed",
},
)
Summary
For large-scale test runs, keeping all
InvocationResponseobjects in memory can be expensive — the bulk of the cost comes fromresponse_text,input_payload, andinput_promptfields.This issue proposes two related features:
1. Low-memory mode (
low_memory=True)A new
low_memoryparameter onRunner/run()that writes responses to disk as they arrive but does not accumulate them in the in-memory_responseslist. Stats are computed incrementally. Requiresoutput_pathto be set. Responses can be loaded on demand viaresult.load_responses().2.
RunningStatsaccumulatorA new
RunningStatsclass that tracks metrics incrementally (counts, sums, sorted value lists for percentile computation). This replaces the_builtin_stats@cached_propertyonResult— stats are now always computed during the run and stored as_preloaded_stats, eliminating redundant recomputation.3. Live progress-bar stats
RunningStats.snapshot()formats a configurable subset of live stats for tqdm display during the run: p50/p90 TTFT, p50/p90 TTLT, median output tokens/s, total input/output tokens, and failure count. Configurable via theprogress_bar_statsparameter.