Skip to content

fix: eliminate floating-point error in response stats averages#783

Open
gchlebus wants to merge 2 commits intomainfrom
gchlebus/fix/response-stats-precision
Open

fix: eliminate floating-point error in response stats averages#783
gchlebus wants to merge 2 commits intomainfrom
gchlebus/fix/response-stats-precision

Conversation

@gchlebus
Copy link
Contributor

@gchlebus gchlebus commented Feb 27, 2026

Problem

The ResponseStatsInterceptor has two issues that cause eval_factory_metrics.json to report incorrect token statistics:

1. Floating-point error from running averages

avg_completion_tokens is computed using a running average with round(..., 2) at each step. Over thousands of requests, the rounding error compounds.

Observed impact:

  • MMLU-Pro (12K requests): avg drifted by 65.7 tokens/request (0.85%)
  • LiveCodeBench (3.7K requests): avg drifted by 335.2 tokens/request (1.34%)

2. Retried requests double-counted across chain job allocations

When Slurm chain jobs requeue (e.g., on TIMEOUT), previously in-flight requests are re-sent. The response_stats interceptor loads aggregated stats from the previous allocation and then counts these re-sent requests again, inflating both count and successful_count.

Observed impact:

  • NeMoTron MMLU-Pro: count=12639 vs expected 12032 (607 retries double-counted)
  • NeMoTron LiveCodeBench: count=3732 vs expected 3632 (100 retries double-counted)

Fix

Commit 1: Precision fix

  • Track exact sums (sum_prompt_tokens, sum_completion_tokens, etc.) internally
  • Compute averages as sum/count instead of incremental running average
  • Round to 2dp only at file-save time, not during accumulation
  • Add total_completion_tokens, total_prompt_tokens, total_total_tokens fields to output

Commit 2: Retry deduplication

  • Track seen cache_keys (SHA256 hash of request content) across allocations
  • When a response arrives for an already-seen cache_key, it's a retry:
    • Still increments count (total API calls made)
    • Does NOT increment successful_count or token sums
  • Add retry_count field to output
  • seen_cache_keys set is persisted in the interceptor state cache, surviving across chain job allocations

Backward compatibility

  • Old cached stats without sum_* / retry_count / seen_cache_keys are automatically migrated
  • Existing field names and rounding preserved in output
  • New fields (total_*, retry_count) are additive

Tests

  • test_average_precision_with_many_requests: verifies exact precision over 10K requests
  • test_total_fields_in_saved_file: verifies total_* fields in output, absence of internal sum_* fields
  • test_retry_deduplication: verifies retries (same cache_key) are detected and excluded from token stats
  • test_retry_deduplication_persists_across_reloads: verifies seen_cache_keys survive save/load cycle (chain job scenario)

All 34 tests pass (2 pre-existing flaky timing tests excluded).

@gchlebus gchlebus requested review from a team as code owners February 27, 2026 15:05
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Replace running average computation (which applied round(..., 2) at each
step, compounding error over thousands of requests) with exact sum-based
tracking. Averages are now computed as sum/count at save time.

Changes:
- Track exact sums (sum_prompt_tokens, sum_completion_tokens, etc.)
  internally alongside running averages
- Compute avg_* from sum/count instead of incremental running average
- Round avg_* to 2dp only at file-save time (not during accumulation)
- Add total_completion_tokens, total_prompt_tokens, total_total_tokens
  to eval_factory_metrics.json output for exact totals
- Backward compatible: loads old cached stats by back-computing sums
  from avg * successful_count

Impact:
For large evaluations (>10K requests), the old running average could
drift by 60+ tokens/request from the true mean. For example, on a
12K-request MMLU-Pro evaluation, avg_completion_tokens was 7776.98
vs the exact value of 7711.3 — a 0.85% error that compounds with
request count.

The new total_* fields enable consumers to compute exact averages
without relying on the rounded avg_* values.

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
When chain jobs requeue (e.g., due to Slurm TIMEOUT), previously
in-flight requests are re-sent. The response_stats interceptor now
detects these retries via cache_key (SHA256 of request content) and
excludes them from token stats to avoid inflating averages.

Observed impact: NeMoTron MMLU-Pro had count=12639 vs expected 12032
(607 retries), inflating avg_completion_tokens from 7711.3 to 7776.98.

Changes:
- Track seen cache_keys in memory and persist across allocations
- Retry requests still increment count (total API calls) but NOT
  successful_count or token sums
- Add retry_count field to eval_factory_metrics.json output
- Backward compatible: old cached stats without seen_cache_keys
  or retry_count are handled gracefully

Tests:
- test_retry_deduplication: verifies retries are detected and excluded
- test_retry_deduplication_persists_across_reloads: verifies the
  seen_cache_keys set survives cache save/load (chain job scenario)

Signed-off-by: Grzegorz Chlebus <gchlebus@nvidia.com>
@gchlebus gchlebus force-pushed the gchlebus/fix/response-stats-precision branch from 7e3b5cb to 425aa1e Compare February 27, 2026 15:30
Copy link
Collaborator

@pribalta pribalta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Copy link
Collaborator

@pribalta pribalta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Power Review

Reviewed 2 files, left 7 suggestions: 1 error, 5 warnings, 1 suggestion.

Dimensions Analyzed

Dimension 🔴 ⚠️ 💡
anti_patterns 0 1 0
architecture 1 2 0
testing-quality 0 2 1
Total 1 5 1

The diff brings change to eliminate floating-point error in ground,
With 7 things our review has found.
Machines can scan but miss the subtle thread,
A human touch ensures the code is read.


Automated review by Power Review — findings are advisory.

output_dir=str(tmp_path), url="http://test.api.com/v1/chat/completions"
)

def make_response(req_id, cache_key, completion_tokens):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] 🔴 Duplicate make_response helper defined identically in two test methods · 97% confidence

The architecture document flags 'Duplicated Utility Functions' as an ERROR: 'AI often regenerates logic that already exists in the codebase. Flag functions that duplicate existing helpers.' The make_response inner function at line 574 (inside test_retry_deduplication) and again at line 633 (inside test_retry_deduplication_persists_across_reloads) are byte-for-byte identical — same signature, same mock construction, same rctx.cache_key assignment, same return.

💡 Suggestion: Extract make_response to a shared static method on the enclosing test class, or to a module-level fixture/factory function. Both test methods can then call the shared helper. Example:

@staticmethod
def _make_cache_keyed_response(req_id, cache_key, completion_tokens, tmp_path):
    ...

output_dir=str(tmp_path), url="http://test.api.com/v1/chat/completions"
)

def make_response(req_id, cache_key, completion_tokens):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] ⚠️ Identical make_response helper duplicated across two test methods · 97% confidence

The make_response nested function defined inside test_retry_deduplication (lines 574–588) is byte-for-byte identical to the one defined inside test_retry_deduplication_persists_across_reloads (lines 633–647). This matches the Copy-Paste Code Blocks anti-pattern: duplicate logic harms maintainability and means any future change (e.g., adding a new field to the mock payload) must be applied in both places.

💡 Suggestion: Extract make_response as a shared module-level helper or a @staticmethod / pytest.fixture on the enclosing test class. Both test methods can then call the single canonical implementation. Example:

def _make_cache_keyed_response(req_id, cache_key, completion_tokens, tmp_path):
    mock_resp = Mock(spec=requests.Response)
    mock_resp.status_code = 200
    mock_resp.headers = {}
    mock_resp.json.return_value = {
        "usage": {
            "prompt_tokens": 50,
            "total_tokens": 50 + completion_tokens,
            "completion_tokens": completion_tokens,
        },
        "choices": [{"finish_reason": "stop", "message": {}}],
    }
    rctx = AdapterRequestContext(request_id=req_id)
    rctx.cache_key = cache_key
    return AdapterResponse(r=mock_resp, rctx=rctx, latency_ms=10.0)

assert interceptor2._stats["successful_count"] == 3
assert interceptor2._stats["sum_completion_tokens"] == 600

def test_total_fields_in_saved_file(self, tmp_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] ⚠️ test_total_fields_in_saved_file does not assert total_total_tokens or the absence of sum_total_tokens · 82% confidence

The PR description explicitly lists total_total_tokens as a new output field. The implementation adds it via stats[f'total_{token_type}'] = stats.pop(f'sum_{token_type}', 0) for token_type in ['prompt_tokens', 'total_tokens', 'completion_tokens']. The test asserts total_completion_tokens and total_prompt_tokens but omits total_total_tokens. It also verifies sum_completion_tokens and sum_prompt_tokens are absent from the output but silently skips sum_total_tokens. A test whose name implies complete coverage of saved-file fields should verify all new fields.

💡 Suggestion: Add assert rs['total_total_tokens'] == sum(50 + ct for ct in token_values) and assert 'sum_total_tokens' not in rs to the test body.

status_codes[key] = value
aggregated_stats["status_codes"] = status_codes

# Backward compatibility: if cached stats lack sum_* fields (from
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] ⚠️ Backward-compatibility migration bolted inline into _load_aggregated_cached_stats (Geological Layers) · 78% confidence

The architecture document warns against 'Geological Layers: new code bolted on top of old code without integrating — duplicate data paths, parallel config mechanisms, wrapper-on-wrapper stacking.' Lines 225–257 add ~35 lines of multi-branch migration logic (back-computing sum_* from avg_*, handling total_*, ensuring retry_count) inline at the top of the existing load method rather than extracting it into a focused helper. This creates a layered accumulation of unrelated responsibilities in one method.

💡 Suggestion: Extract the migration block into a private method such as _migrate_legacy_cached_stats(aggregated_stats: dict) -> dict. _load_aggregated_cached_stats then calls it as a single delegation step, keeping the load method's intent readable:

aggregated_stats = self._migrate_legacy_cached_stats(aggregated_stats)

assert interceptor._stats["count"] == expected_total_responses
assert interceptor._stats["successful_count"] == expected_successful_responses

def test_average_precision_with_many_requests(self, tmp_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] ⚠️ No test covers the backward-compatibility migration path for old cached stats · 72% confidence

The knowledge document warns against happy-path-only test suites: 'test invalid input, empty collections, null/None values, max values, and expected exceptions.' The implementation adds substantial migration logic (lines 225–265 in the interceptor) that back-computes sum_* fields from avg * successful_count when loading old-format cache files that lack sum_* keys. All four new tests create fresh interceptors or load newly-written cache data. None simulate loading a legacy cache file (one without sum_prompt_tokens, sum_completion_tokens, sum_latency_ms, or retry_count). A regression in that migration branch would go undetected.

💡 Suggestion: Add a test that manually writes a cache JSON file in the old format (containing only avg_* and successful_count, no sum_*), constructs a new ResponseStatsInterceptor pointing at that cache dir, and asserts that the loaded sum_* fields are correctly back-computed (e.g., sum_completion_tokens ≈ avg_completion_tokens * successful_count).

return resp
status_code = resp.r.status_code

# Detect retries: if we've already counted a successful response for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] ⚠️ intercept_response grows significantly beyond 50 lines with retry logic added inline · 72% confidence

The architecture document warns: 'Monolithic Functions (>50 lines): AI generates long, flat functions instead of composing smaller units.' The diff adds ~55 lines of new code to intercept_response (retry detection at lines 462–468, retry branch at 480–488, restructured non-retry block at 490–519). Even if the function was borderline before, these additions push it well past the ~50-line guideline and mix three distinct concerns: retry detection, retry accounting, and normal stats collection.

💡 Suggestion: Extract the retry detection and retry-path handling into a dedicated private method, e.g. _handle_retry_response(resp, cache_key, context). The main intercept_response orchestrates at a higher level:

is_retry = self._is_retry(resp)
if is_retry:
    self._handle_retry(resp, cache_key)
else:
    self._handle_successful_response(resp, context, status_code)

output_dir=str(tmp_path), url="http://test.api.com/v1/chat/completions"
)

def make_response(req_id, cache_key, completion_tokens):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Power Review] 💡 Identical make_response helper duplicated across two test methods · 92% confidence

The knowledge document explicitly flags duplicated test setup as a testing quality issue: 'AI copies setup code into every test function instead of using fixtures or setUp methods. Instead: extract shared setup into fixtures or helper functions.' The make_response inner function at lines 574–588 is byte-for-byte identical to the one at lines 633–647.

💡 Suggestion: Extract make_response into a module-level helper function or a pytest fixture shared by both test_retry_deduplication and test_retry_deduplication_persists_across_reloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants