Skip to content

fix: stabilize recording hash normalization to reduce flaky integration tests#5252

Closed
leseb wants to merge 2 commits intollamastack:mainfrom
leseb:fix-recording-hash-normalization
Closed

fix: stabilize recording hash normalization to reduce flaky integration tests#5252
leseb wants to merge 2 commits intollamastack:mainfrom
leseb:fix-recording-hash-normalization

Conversation

@leseb
Copy link
Copy Markdown
Collaborator

@leseb leseb commented Mar 23, 2026

Summary

  • Enhance _normalize_body_for_hash() in api_recorder.py to produce stable hashes across semantically equivalent but structurally different request bodies
  • Delete 22 stale recordings for test_mcp_invocation that accumulated from hash drift — they will regenerate on the next CI run

Problem

test_mcp_invocation fails ~17% of the time in CI (docker, ollama, base) because the recording/replay hash changes when code evolves. The same test had 19-22 different recordings representing different request body variants that are semantically identical.

Normalizations added

Field Before After
max_tokens None vs 0 produce different hashes Both dropped from hash
tool_choice None vs "auto" produce different hashes Both dropped from hash
Message content [{"type": "text", "text": "X"}] vs "X" Collapsed to "X"
Tool call IDs call_c1tlwvxc vs call_oezek4up Replaced with stable placeholder

Test plan

  • 10 new unit tests covering each normalization
  • All existing recording tests pass
  • CI

Signed-off-by: Sebastien Han shan@redhat.com

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 23, 2026
@leseb
Copy link
Copy Markdown
Collaborator Author

leseb commented Mar 23, 2026

this is an attempt to reduce flakiness of Integration Tests (docker, ollama, 3.12, client=latest, base)

The test_mcp_invocation test was flaky (~17% failure rate) because the
recording/replay hash changed across code versions due to semantically
equivalent but structurally different request bodies.

Add normalizations to _normalize_body_for_hash() for:
- max_tokens: treat None and 0 as equivalent (both dropped from hash)
- tool_choice: treat None and "auto" as equivalent (both dropped)
- Message content: collapse [{type: text, text: X}] to plain string "X"
- Tool call IDs: replace random call_xxx IDs with a stable placeholder

Delete stale recordings for test_mcp_invocation so they will be
re-generated on the next CI run with record-if-missing mode.

Add unit tests covering each normalization rule and a combined test
verifying that two structurally different but semantically equivalent
request bodies produce the same hash.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
@leseb leseb force-pushed the fix-recording-hash-normalization branch from 44a465a to 4c1de12 Compare March 23, 2026 15:07
Re-hash all 2625 integration test recordings using the new
normalization in _normalize_body_for_hash(). This collapses
927 duplicate recordings that differed only in semantically
equivalent fields (max_tokens null vs 0, tool_choice null vs
auto, content format, tool call IDs).

2625 recordings → 1698 after deduplication.

Signed-off-by: Sébastien Han <seb@redhat.com>
@github-actions
Copy link
Copy Markdown
Contributor

Recording workflow finished with status: failure

Providers: gpt, azure, watsonx

Recording attempt finished. Check the workflow run for details.

View workflow run

Fork PR: Recordings will be committed if you have "Allow edits from maintainers" enabled.

@iamemilio
Copy link
Copy Markdown
Contributor

any correlation between this: #5233 ?

@leseb
Copy link
Copy Markdown
Collaborator Author

leseb commented Mar 24, 2026

any correlation between this: #5233 ?

yes

@leseb leseb closed this Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants