fix: stabilize recording hash normalization to reduce flaky integration tests#5252
Closed
leseb wants to merge 2 commits intollamastack:mainfrom
Closed
fix: stabilize recording hash normalization to reduce flaky integration tests#5252leseb wants to merge 2 commits intollamastack:mainfrom
leseb wants to merge 2 commits intollamastack:mainfrom
Conversation
Collaborator
Author
|
this is an attempt to reduce flakiness of |
The test_mcp_invocation test was flaky (~17% failure rate) because the
recording/replay hash changed across code versions due to semantically
equivalent but structurally different request bodies.
Add normalizations to _normalize_body_for_hash() for:
- max_tokens: treat None and 0 as equivalent (both dropped from hash)
- tool_choice: treat None and "auto" as equivalent (both dropped)
- Message content: collapse [{type: text, text: X}] to plain string "X"
- Tool call IDs: replace random call_xxx IDs with a stable placeholder
Delete stale recordings for test_mcp_invocation so they will be
re-generated on the next CI run with record-if-missing mode.
Add unit tests covering each normalization rule and a combined test
verifying that two structurally different but semantically equivalent
request bodies produce the same hash.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
44a465a to
4c1de12
Compare
Re-hash all 2625 integration test recordings using the new normalization in _normalize_body_for_hash(). This collapses 927 duplicate recordings that differed only in semantically equivalent fields (max_tokens null vs 0, tool_choice null vs auto, content format, tool call IDs). 2625 recordings → 1698 after deduplication. Signed-off-by: Sébastien Han <seb@redhat.com>
Contributor
|
Recording workflow finished with status: failure Providers: gpt, azure, watsonx Recording attempt finished. Check the workflow run for details. Fork PR: Recordings will be committed if you have "Allow edits from maintainers" enabled. |
Contributor
|
any correlation between this: #5233 ? |
Collaborator
Author
yes |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_normalize_body_for_hash()inapi_recorder.pyto produce stable hashes across semantically equivalent but structurally different request bodiestest_mcp_invocationthat accumulated from hash drift — they will regenerate on the next CI runProblem
test_mcp_invocationfails ~17% of the time in CI (docker, ollama, base) because the recording/replay hash changes when code evolves. The same test had 19-22 different recordings representing different request body variants that are semantically identical.Normalizations added
max_tokensNonevs0produce different hashestool_choiceNonevs"auto"produce different hashes[{"type": "text", "text": "X"}]vs"X""X"call_c1tlwvxcvscall_oezek4upTest plan
Signed-off-by: Sebastien Han shan@redhat.com