fix: eliminate floating-point error in response stats averages#783
fix: eliminate floating-point error in response stats averages#783
Conversation
Replace running average computation (which applied round(..., 2) at each step, compounding error over thousands of requests) with exact sum-based tracking. Averages are now computed as sum/count at save time. Changes: - Track exact sums (sum_prompt_tokens, sum_completion_tokens, etc.) internally alongside running averages - Compute avg_* from sum/count instead of incremental running average - Round avg_* to 2dp only at file-save time (not during accumulation) - Add total_completion_tokens, total_prompt_tokens, total_total_tokens to eval_factory_metrics.json output for exact totals - Backward compatible: loads old cached stats by back-computing sums from avg * successful_count Impact: For large evaluations (>10K requests), the old running average could drift by 60+ tokens/request from the true mean. For example, on a 12K-request MMLU-Pro evaluation, avg_completion_tokens was 7776.98 vs the exact value of 7711.3 — a 0.85% error that compounds with request count. The new total_* fields enable consumers to compute exact averages without relying on the rounded avg_* values. Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
When chain jobs requeue (e.g., due to Slurm TIMEOUT), previously in-flight requests are re-sent. The response_stats interceptor now detects these retries via cache_key (SHA256 of request content) and excludes them from token stats to avoid inflating averages. Observed impact: NeMoTron MMLU-Pro had count=12639 vs expected 12032 (607 retries), inflating avg_completion_tokens from 7711.3 to 7776.98. Changes: - Track seen cache_keys in memory and persist across allocations - Retry requests still increment count (total API calls) but NOT successful_count or token sums - Add retry_count field to eval_factory_metrics.json output - Backward compatible: old cached stats without seen_cache_keys or retry_count are handled gracefully Tests: - test_retry_deduplication: verifies retries are detected and excluded - test_retry_deduplication_persists_across_reloads: verifies the seen_cache_keys set survives cache save/load (chain job scenario) Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
7e3b5cb to
425aa1e
Compare
There was a problem hiding this comment.
Power Review
Reviewed 2 files, left 7 suggestions: 1 error, 5 warnings, 1 suggestion.
Dimensions Analyzed
| Dimension | 🔴 | 💡 | |
|---|---|---|---|
| anti_patterns | 0 | 1 | 0 |
| architecture | 1 | 2 | 0 |
| testing-quality | 0 | 2 | 1 |
| Total | 1 | 5 | 1 |
The diff brings change to eliminate floating-point error in ground,
With 7 things our review has found.
Machines can scan but miss the subtle thread,
A human touch ensures the code is read.
Automated review by Power Review — findings are advisory.
| output_dir=str(tmp_path), url="http://test.api.com/v1/chat/completions" | ||
| ) | ||
|
|
||
| def make_response(req_id, cache_key, completion_tokens): |
There was a problem hiding this comment.
[Power Review] 🔴 Duplicate make_response helper defined identically in two test methods · 97% confidence
The architecture document flags 'Duplicated Utility Functions' as an ERROR: 'AI often regenerates logic that already exists in the codebase. Flag functions that duplicate existing helpers.' The make_response inner function at line 574 (inside test_retry_deduplication) and again at line 633 (inside test_retry_deduplication_persists_across_reloads) are byte-for-byte identical — same signature, same mock construction, same rctx.cache_key assignment, same return.
💡 Suggestion: Extract make_response to a shared static method on the enclosing test class, or to a module-level fixture/factory function. Both test methods can then call the shared helper. Example:
@staticmethod
def _make_cache_keyed_response(req_id, cache_key, completion_tokens, tmp_path):
...| output_dir=str(tmp_path), url="http://test.api.com/v1/chat/completions" | ||
| ) | ||
|
|
||
| def make_response(req_id, cache_key, completion_tokens): |
There was a problem hiding this comment.
[Power Review] make_response helper duplicated across two test methods · 97% confidence
The make_response nested function defined inside test_retry_deduplication (lines 574–588) is byte-for-byte identical to the one defined inside test_retry_deduplication_persists_across_reloads (lines 633–647). This matches the Copy-Paste Code Blocks anti-pattern: duplicate logic harms maintainability and means any future change (e.g., adding a new field to the mock payload) must be applied in both places.
💡 Suggestion: Extract make_response as a shared module-level helper or a @staticmethod / pytest.fixture on the enclosing test class. Both test methods can then call the single canonical implementation. Example:
def _make_cache_keyed_response(req_id, cache_key, completion_tokens, tmp_path):
mock_resp = Mock(spec=requests.Response)
mock_resp.status_code = 200
mock_resp.headers = {}
mock_resp.json.return_value = {
"usage": {
"prompt_tokens": 50,
"total_tokens": 50 + completion_tokens,
"completion_tokens": completion_tokens,
},
"choices": [{"finish_reason": "stop", "message": {}}],
}
rctx = AdapterRequestContext(request_id=req_id)
rctx.cache_key = cache_key
return AdapterResponse(r=mock_resp, rctx=rctx, latency_ms=10.0)| assert interceptor2._stats["successful_count"] == 3 | ||
| assert interceptor2._stats["sum_completion_tokens"] == 600 | ||
|
|
||
| def test_total_fields_in_saved_file(self, tmp_path): |
There was a problem hiding this comment.
[Power Review] test_total_fields_in_saved_file does not assert total_total_tokens or the absence of sum_total_tokens · 82% confidence
The PR description explicitly lists total_total_tokens as a new output field. The implementation adds it via stats[f'total_{token_type}'] = stats.pop(f'sum_{token_type}', 0) for token_type in ['prompt_tokens', 'total_tokens', 'completion_tokens']. The test asserts total_completion_tokens and total_prompt_tokens but omits total_total_tokens. It also verifies sum_completion_tokens and sum_prompt_tokens are absent from the output but silently skips sum_total_tokens. A test whose name implies complete coverage of saved-file fields should verify all new fields.
💡 Suggestion: Add assert rs['total_total_tokens'] == sum(50 + ct for ct in token_values) and assert 'sum_total_tokens' not in rs to the test body.
| status_codes[key] = value | ||
| aggregated_stats["status_codes"] = status_codes | ||
|
|
||
| # Backward compatibility: if cached stats lack sum_* fields (from |
There was a problem hiding this comment.
[Power Review] _load_aggregated_cached_stats (Geological Layers) · 78% confidence
The architecture document warns against 'Geological Layers: new code bolted on top of old code without integrating — duplicate data paths, parallel config mechanisms, wrapper-on-wrapper stacking.' Lines 225–257 add ~35 lines of multi-branch migration logic (back-computing sum_* from avg_*, handling total_*, ensuring retry_count) inline at the top of the existing load method rather than extracting it into a focused helper. This creates a layered accumulation of unrelated responsibilities in one method.
💡 Suggestion: Extract the migration block into a private method such as _migrate_legacy_cached_stats(aggregated_stats: dict) -> dict. _load_aggregated_cached_stats then calls it as a single delegation step, keeping the load method's intent readable:
aggregated_stats = self._migrate_legacy_cached_stats(aggregated_stats)| assert interceptor._stats["count"] == expected_total_responses | ||
| assert interceptor._stats["successful_count"] == expected_successful_responses | ||
|
|
||
| def test_average_precision_with_many_requests(self, tmp_path): |
There was a problem hiding this comment.
[Power Review]
The knowledge document warns against happy-path-only test suites: 'test invalid input, empty collections, null/None values, max values, and expected exceptions.' The implementation adds substantial migration logic (lines 225–265 in the interceptor) that back-computes sum_* fields from avg * successful_count when loading old-format cache files that lack sum_* keys. All four new tests create fresh interceptors or load newly-written cache data. None simulate loading a legacy cache file (one without sum_prompt_tokens, sum_completion_tokens, sum_latency_ms, or retry_count). A regression in that migration branch would go undetected.
💡 Suggestion: Add a test that manually writes a cache JSON file in the old format (containing only avg_* and successful_count, no sum_*), constructs a new ResponseStatsInterceptor pointing at that cache dir, and asserts that the loaded sum_* fields are correctly back-computed (e.g., sum_completion_tokens ≈ avg_completion_tokens * successful_count).
| return resp | ||
| status_code = resp.r.status_code | ||
|
|
||
| # Detect retries: if we've already counted a successful response for |
There was a problem hiding this comment.
[Power Review] intercept_response grows significantly beyond 50 lines with retry logic added inline · 72% confidence
The architecture document warns: 'Monolithic Functions (>50 lines): AI generates long, flat functions instead of composing smaller units.' The diff adds ~55 lines of new code to intercept_response (retry detection at lines 462–468, retry branch at 480–488, restructured non-retry block at 490–519). Even if the function was borderline before, these additions push it well past the ~50-line guideline and mix three distinct concerns: retry detection, retry accounting, and normal stats collection.
💡 Suggestion: Extract the retry detection and retry-path handling into a dedicated private method, e.g. _handle_retry_response(resp, cache_key, context). The main intercept_response orchestrates at a higher level:
is_retry = self._is_retry(resp)
if is_retry:
self._handle_retry(resp, cache_key)
else:
self._handle_successful_response(resp, context, status_code)| output_dir=str(tmp_path), url="http://test.api.com/v1/chat/completions" | ||
| ) | ||
|
|
||
| def make_response(req_id, cache_key, completion_tokens): |
There was a problem hiding this comment.
[Power Review] 💡 Identical make_response helper duplicated across two test methods · 92% confidence
The knowledge document explicitly flags duplicated test setup as a testing quality issue: 'AI copies setup code into every test function instead of using fixtures or setUp methods. Instead: extract shared setup into fixtures or helper functions.' The make_response inner function at lines 574–588 is byte-for-byte identical to the one at lines 633–647.
💡 Suggestion: Extract make_response into a module-level helper function or a pytest fixture shared by both test_retry_deduplication and test_retry_deduplication_persists_across_reloads.
Problem
The
ResponseStatsInterceptorhas two issues that causeeval_factory_metrics.jsonto report incorrect token statistics:1. Floating-point error from running averages
avg_completion_tokensis computed using a running average withround(..., 2)at each step. Over thousands of requests, the rounding error compounds.Observed impact:
2. Retried requests double-counted across chain job allocations
When Slurm chain jobs requeue (e.g., on TIMEOUT), previously in-flight requests are re-sent. The response_stats interceptor loads aggregated stats from the previous allocation and then counts these re-sent requests again, inflating both
countandsuccessful_count.Observed impact:
count=12639vs expected12032(607 retries double-counted)count=3732vs expected3632(100 retries double-counted)Fix
Commit 1: Precision fix
sum_prompt_tokens,sum_completion_tokens, etc.) internallysum/countinstead of incremental running averagetotal_completion_tokens,total_prompt_tokens,total_total_tokensfields to outputCommit 2: Retry deduplication
cache_keys (SHA256 hash of request content) across allocationscount(total API calls made)successful_countor token sumsretry_countfield to outputseen_cache_keysset is persisted in the interceptor state cache, surviving across chain job allocationsBackward compatibility
sum_*/retry_count/seen_cache_keysare automatically migratedtotal_*,retry_count) are additiveTests
test_average_precision_with_many_requests: verifies exact precision over 10K requeststest_total_fields_in_saved_file: verifiestotal_*fields in output, absence of internalsum_*fieldstest_retry_deduplication: verifies retries (same cache_key) are detected and excluded from token statstest_retry_deduplication_persists_across_reloads: verifies seen_cache_keys survive save/load cycle (chain job scenario)All 34 tests pass (2 pre-existing flaky timing tests excluded).