fix: eliminate floating-point error in response stats averages by gchlebus · Pull Request #783 · NVIDIA-NeMo/Evaluator

gchlebus · 2026-02-27T15:05:48Z

Problem

The ResponseStatsInterceptor has two issues that cause eval_factory_metrics.json to report incorrect token statistics:

1. Floating-point error from running averages

avg_completion_tokens is computed using a running average with round(..., 2) at each step. Over thousands of requests, the rounding error compounds.

Observed impact:

MMLU-Pro (12K requests): avg drifted by 65.7 tokens/request (0.85%)
LiveCodeBench (3.7K requests): avg drifted by 335.2 tokens/request (1.34%)

2. Retried requests double-counted across chain job allocations

When Slurm chain jobs requeue (e.g., on TIMEOUT), previously in-flight requests are re-sent. The response_stats interceptor loads aggregated stats from the previous allocation and then counts these re-sent requests again, inflating both count and successful_count.

Observed impact:

NeMoTron MMLU-Pro: count=12639 vs expected 12032 (607 retries double-counted)
NeMoTron LiveCodeBench: count=3732 vs expected 3632 (100 retries double-counted)

Fix

Commit 1: Precision fix

Track exact sums (sum_prompt_tokens, sum_completion_tokens, etc.) internally
Compute averages as sum/count instead of incremental running average
Round to 2dp only at file-save time, not during accumulation
Add total_completion_tokens, total_prompt_tokens, total_total_tokens fields to output

Commit 2: Retry deduplication

Track seen cache_keys (SHA256 hash of request content) across allocations
When a response arrives for an already-seen cache_key, it's a retry:
- Still increments count (total API calls made)
- Does NOT increment successful_count or token sums
Add retry_count field to output
seen_cache_keys set is persisted in the interceptor state cache, surviving across chain job allocations

Backward compatibility

Old cached stats without sum_* / retry_count / seen_cache_keys are automatically migrated
Existing field names and rounding preserved in output
New fields (total_*, retry_count) are additive

Tests

test_average_precision_with_many_requests: verifies exact precision over 10K requests
test_total_fields_in_saved_file: verifies total_* fields in output, absence of internal sum_* fields
test_retry_deduplication: verifies retries (same cache_key) are detected and excluded from token stats
test_retry_deduplication_persists_across_reloads: verifies seen_cache_keys survive save/load cycle (chain job scenario)

All 34 tests pass (2 pre-existing flaky timing tests excluded).

copy-pr-bot · 2026-02-27T15:05:52Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Replace running average computation (which applied round(..., 2) at each step, compounding error over thousands of requests) with exact sum-based tracking. Averages are now computed as sum/count at save time. Changes: - Track exact sums (sum_prompt_tokens, sum_completion_tokens, etc.) internally alongside running averages - Compute avg_* from sum/count instead of incremental running average - Round avg_* to 2dp only at file-save time (not during accumulation) - Add total_completion_tokens, total_prompt_tokens, total_total_tokens to eval_factory_metrics.json output for exact totals - Backward compatible: loads old cached stats by back-computing sums from avg * successful_count Impact: For large evaluations (>10K requests), the old running average could drift by 60+ tokens/request from the true mean. For example, on a 12K-request MMLU-Pro evaluation, avg_completion_tokens was 7776.98 vs the exact value of 7711.3 — a 0.85% error that compounds with request count. The new total_* fields enable consumers to compute exact averages without relying on the rounded avg_* values. Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

When chain jobs requeue (e.g., due to Slurm TIMEOUT), previously in-flight requests are re-sent. The response_stats interceptor now detects these retries via cache_key (SHA256 of request content) and excludes them from token stats to avoid inflating averages. Observed impact: NeMoTron MMLU-Pro had count=12639 vs expected 12032 (607 retries), inflating avg_completion_tokens from 7711.3 to 7776.98. Changes: - Track seen cache_keys in memory and persist across allocations - Retry requests still increment count (total API calls) but NOT successful_count or token sums - Add retry_count field to eval_factory_metrics.json output - Backward compatible: old cached stats without seen_cache_keys or retry_count are handled gracefully Tests: - test_retry_deduplication: verifies retries are detected and excluded - test_retry_deduplication_persists_across_reloads: verifies the seen_cache_keys set survives cache save/load (chain job scenario) Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>

pribalta

test

pribalta

Power Review

Reviewed 2 files, left 7 suggestions: 1 error, 5 warnings, 1 suggestion.

Dimensions Analyzed

Dimension	🔴	⚠️	💡
anti_patterns	0	1	0
architecture	1	2	0
testing-quality	0	2	1
Total	1	5	1

The diff brings change to eliminate floating-point error in ground,
With 7 things our review has found.
Machines can scan but miss the subtle thread,
A human touch ensures the code is read.

Automated review by Power Review — findings are advisory.

pribalta · 2026-02-28T23:38:03Z

...ges/nemo-evaluator/tests/unit_tests/adapters/interceptors/test_response_stats_interceptor.py

+            output_dir=str(tmp_path), url="http://test.api.com/v1/chat/completions"
+        )
+
+        def make_response(req_id, cache_key, completion_tokens):


[Power Review] 🔴 Duplicate make_response helper defined identically in two test methods · 97% confidence

The architecture document flags 'Duplicated Utility Functions' as an ERROR: 'AI often regenerates logic that already exists in the codebase. Flag functions that duplicate existing helpers.' The make_response inner function at line 574 (inside test_retry_deduplication) and again at line 633 (inside test_retry_deduplication_persists_across_reloads) are byte-for-byte identical — same signature, same mock construction, same rctx.cache_key assignment, same return.

💡 Suggestion: Extract make_response to a shared static method on the enclosing test class, or to a module-level fixture/factory function. Both test methods can then call the shared helper. Example:

@staticmethod def _make_cache_keyed_response(req_id, cache_key, completion_tokens, tmp_path): ...

pribalta · 2026-02-28T23:38:03Z

...ges/nemo-evaluator/tests/unit_tests/adapters/interceptors/test_response_stats_interceptor.py

+            output_dir=str(tmp_path), url="http://test.api.com/v1/chat/completions"
+        )
+
+        def make_response(req_id, cache_key, completion_tokens):


[Power Review] ⚠️ Identical make_response helper duplicated across two test methods · 97% confidence

The make_response nested function defined inside test_retry_deduplication (lines 574–588) is byte-for-byte identical to the one defined inside test_retry_deduplication_persists_across_reloads (lines 633–647). This matches the Copy-Paste Code Blocks anti-pattern: duplicate logic harms maintainability and means any future change (e.g., adding a new field to the mock payload) must be applied in both places.

💡 Suggestion: Extract make_response as a shared module-level helper or a @staticmethod / pytest.fixture on the enclosing test class. Both test methods can then call the single canonical implementation. Example:

def _make_cache_keyed_response(req_id, cache_key, completion_tokens, tmp_path): mock_resp = Mock(spec=requests.Response) mock_resp.status_code = 200 mock_resp.headers = {} mock_resp.json.return_value = { "usage": { "prompt_tokens": 50, "total_tokens": 50 + completion_tokens, "completion_tokens": completion_tokens, }, "choices": [{"finish_reason": "stop", "message": {}}], } rctx = AdapterRequestContext(request_id=req_id) rctx.cache_key = cache_key return AdapterResponse(r=mock_resp, rctx=rctx, latency_ms=10.0)

pribalta · 2026-02-28T23:38:03Z

...ges/nemo-evaluator/tests/unit_tests/adapters/interceptors/test_response_stats_interceptor.py

+        assert interceptor2._stats["successful_count"] == 3
+        assert interceptor2._stats["sum_completion_tokens"] == 600
+
+    def test_total_fields_in_saved_file(self, tmp_path):


[Power Review] ⚠️ test_total_fields_in_saved_file does not assert total_total_tokens or the absence of sum_total_tokens · 82% confidence

The PR description explicitly lists total_total_tokens as a new output field. The implementation adds it via stats[f'total_{token_type}'] = stats.pop(f'sum_{token_type}', 0) for token_type in ['prompt_tokens', 'total_tokens', 'completion_tokens']. The test asserts total_completion_tokens and total_prompt_tokens but omits total_total_tokens. It also verifies sum_completion_tokens and sum_prompt_tokens are absent from the output but silently skips sum_total_tokens. A test whose name implies complete coverage of saved-file fields should verify all new fields.

💡 Suggestion: Add assert rs['total_total_tokens'] == sum(50 + ct for ct in token_values) and assert 'sum_total_tokens' not in rs to the test body.

pribalta · 2026-02-28T23:38:04Z

packages/nemo-evaluator/src/nemo_evaluator/adapters/interceptors/response_stats_interceptor.py

                        status_codes[key] = value
                aggregated_stats["status_codes"] = status_codes

+            # Backward compatibility: if cached stats lack sum_* fields (from


[Power Review] ⚠️ Backward-compatibility migration bolted inline into _load_aggregated_cached_stats (Geological Layers) · 78% confidence

The architecture document warns against 'Geological Layers: new code bolted on top of old code without integrating — duplicate data paths, parallel config mechanisms, wrapper-on-wrapper stacking.' Lines 225–257 add ~35 lines of multi-branch migration logic (back-computing sum_* from avg_*, handling total_*, ensuring retry_count) inline at the top of the existing load method rather than extracting it into a focused helper. This creates a layered accumulation of unrelated responsibilities in one method.

💡 Suggestion: Extract the migration block into a private method such as _migrate_legacy_cached_stats(aggregated_stats: dict) -> dict. _load_aggregated_cached_stats then calls it as a single delegation step, keeping the load method's intent readable:

aggregated_stats = self._migrate_legacy_cached_stats(aggregated_stats)

pribalta · 2026-02-28T23:38:04Z

...ges/nemo-evaluator/tests/unit_tests/adapters/interceptors/test_response_stats_interceptor.py

        assert interceptor._stats["count"] == expected_total_responses
        assert interceptor._stats["successful_count"] == expected_successful_responses

+    def test_average_precision_with_many_requests(self, tmp_path):


[Power Review] ⚠️ No test covers the backward-compatibility migration path for old cached stats · 72% confidence

The knowledge document warns against happy-path-only test suites: 'test invalid input, empty collections, null/None values, max values, and expected exceptions.' The implementation adds substantial migration logic (lines 225–265 in the interceptor) that back-computes sum_* fields from avg * successful_count when loading old-format cache files that lack sum_* keys. All four new tests create fresh interceptors or load newly-written cache data. None simulate loading a legacy cache file (one without sum_prompt_tokens, sum_completion_tokens, sum_latency_ms, or retry_count). A regression in that migration branch would go undetected.

💡 Suggestion: Add a test that manually writes a cache JSON file in the old format (containing only avg_* and successful_count, no sum_*), constructs a new ResponseStatsInterceptor pointing at that cache dir, and asserts that the loaded sum_* fields are correctly back-computed (e.g., sum_completion_tokens ≈ avg_completion_tokens * successful_count).

pribalta · 2026-02-28T23:38:04Z

packages/nemo-evaluator/src/nemo_evaluator/adapters/interceptors/response_stats_interceptor.py

            return resp
        status_code = resp.r.status_code

+        # Detect retries: if we've already counted a successful response for


[Power Review] ⚠️ intercept_response grows significantly beyond 50 lines with retry logic added inline · 72% confidence

The architecture document warns: 'Monolithic Functions (>50 lines): AI generates long, flat functions instead of composing smaller units.' The diff adds ~55 lines of new code to intercept_response (retry detection at lines 462–468, retry branch at 480–488, restructured non-retry block at 490–519). Even if the function was borderline before, these additions push it well past the ~50-line guideline and mix three distinct concerns: retry detection, retry accounting, and normal stats collection.

💡 Suggestion: Extract the retry detection and retry-path handling into a dedicated private method, e.g. _handle_retry_response(resp, cache_key, context). The main intercept_response orchestrates at a higher level:

is_retry = self._is_retry(resp) if is_retry: self._handle_retry(resp, cache_key) else: self._handle_successful_response(resp, context, status_code)

pribalta · 2026-02-28T23:38:04Z

...ges/nemo-evaluator/tests/unit_tests/adapters/interceptors/test_response_stats_interceptor.py

+            output_dir=str(tmp_path), url="http://test.api.com/v1/chat/completions"
+        )
+
+        def make_response(req_id, cache_key, completion_tokens):


[Power Review] 💡 Identical make_response helper duplicated across two test methods · 92% confidence

The knowledge document explicitly flags duplicated test setup as a testing quality issue: 'AI copies setup code into every test function instead of using fixtures or setUp methods. Instead: extract shared setup into fixtures or helper functions.' The make_response inner function at lines 574–588 is byte-for-byte identical to the one at lines 633–647.

💡 Suggestion: Extract make_response into a module-level helper function or a pytest fixture shared by both test_retry_deduplication and test_retry_deduplication_persists_across_reloads.

gchlebus requested review from a team as code owners February 27, 2026 15:05

github-actions bot added nemo-evaluator tests labels Feb 27, 2026

gchlebus added 2 commits February 27, 2026 16:30

gchlebus force-pushed the gchlebus/fix/response-stats-precision branch from 7e3b5cb to 425aa1e Compare February 27, 2026 15:30

pribalta reviewed Feb 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: eliminate floating-point error in response stats averages#783

fix: eliminate floating-point error in response stats averages#783
gchlebus wants to merge 2 commits intomainfrom
gchlebus/fix/response-stats-precision

gchlebus commented Feb 27, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 27, 2026

Uh oh!

pribalta left a comment •

edited

Loading

Uh oh!

pribalta left a comment •

edited

Loading

Uh oh!

pribalta Feb 28, 2026

Uh oh!

pribalta Feb 28, 2026

Uh oh!

pribalta Feb 28, 2026

Uh oh!

pribalta Feb 28, 2026

Uh oh!

pribalta Feb 28, 2026

Uh oh!

pribalta Feb 28, 2026

Uh oh!

pribalta Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gchlebus commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

1. Floating-point error from running averages

2. Retried requests double-counted across chain job allocations

Fix

Commit 1: Precision fix

Commit 2: Retry deduplication

Backward compatibility

Tests

Uh oh!

copy-pr-bot bot commented Feb 27, 2026

Uh oh!

pribalta left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pribalta left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Power Review

Dimensions Analyzed

Uh oh!

pribalta Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

pribalta Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

pribalta Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

pribalta Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

pribalta Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

pribalta Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

pribalta Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gchlebus commented Feb 27, 2026 •

edited

Loading

pribalta left a comment •

edited

Loading

pribalta left a comment •

edited

Loading