fix(ccr): store pre-protection original, not tag placeholder, in CCR#1208
Open
yulin0629 wants to merge 1 commit into
Open
fix(ccr): store pre-protection original, not tag placeholder, in CCR#1208yulin0629 wants to merge 1 commit into
yulin0629 wants to merge 1 commit into
Conversation
ContentRouter protects custom tags (e.g. <system-reminder>) into
{{HEADROOM_TAG_N}} placeholders before invoking Kompress. The Kompress
CCR store persisted that placeholdered intermediate as original_content,
so a later full retrieve returned {{HEADROOM_TAG_0}} and the protected
tag block was lost — even though restore_tags correctly fixed the
immediate upstream output.
Pass the pre-protection content as ccr_original through
compress()/compress_batch() so CCR stores the real source text; the
model still receives the placeholdered text. Covers all four CCR store
sites (inline compress, single-content batch delegation, batch
sequential fallback, batch GPU path). Router only adds the kwarg when
tags were actually protected, keeping direct callers unaffected.
AI-Agent: Claude Code
AI-Session-IDs: d065f19d-7b35-45df-8adb-cf42f1bcdd10
Contributor
PR governanceThis PR follows the template and is marked ready for human review. |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a CCR data corruption edge case where Kompress inputs that were tag-protected (e.g. <system-reminder>…</system-reminder> → {{HEADROOM_TAG_N}}) could cause CCR to persist the placeholder intermediate as original_content, breaking later full retrieval/expansion.
Changes:
- Thread the pre-tag-protection source through the Kompress API (
ccr_original/ccr_originals) so CCR stores the real original text while the model still sees placeholders. - Update Kompress CCR store sites (single + batch paths) to store
ccr_original*when provided, and recompute stored-token counts from the stored text. - Add a regression test suite to pin router→compressor plumbing and CCR store-site behavior without loading the full model.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
headroom/transforms/content_router.py |
Passes ccr_original=content only when tag protection is active to keep CCR originals lossless. |
headroom/transforms/kompress_compressor.py |
Adds ccr_original* parameters and uses them at CCR store sites to avoid persisting tag placeholders as originals. |
tests/test_ccr_tag_placeholder_regression.py |
New regression coverage for router forwarding, API validation, and CCR store-site correctness. |
Comments suppressed due to low confidence (2)
headroom/transforms/kompress_compressor.py:988
- When
ccr_originalis provided, CCR storesccr_source(and you computeccr_source_tokens), but the retrieval marker still reportsn_wordsfrom the placeholdercontent. This makes the marker’s "N items" inconsistent with what will be returned on full retrieval and can skew downstream marker-based accounting.
result.compressed += (
f"\n[{n_words} items compressed to {compressed_count}."
f" Retrieve more: hash={cache_key}]"
)
headroom/transforms/kompress_compressor.py:1265
- Same issue on the batched
compress_batch()CCR store path: whenccr_originalsis used, the marker still reportsn_wordsfrom the placeholdercontent, not the storedccr_sourcetoken count. The marker should reflect the size of the stored original to keep retrieval/metrics consistent.
result.compressed += (
f"\n[{n_words} items compressed to {compressed_count}."
f" Retrieve more: hash={cache_key}]"
)
Comment on lines
+1
to
+5
| """Regression: CCR must store the pre-protection original, not the | ||
| ``{{HEADROOM_TAG_N}}`` placeholder intermediate, for tag-protected Kompress inputs. | ||
|
|
||
| Before the fix, ``ContentRouter._try_ml_compressor`` passed the tag-protected | ||
| (placeholdered) text into ``KompressCompressor.compress`` without an original, so |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When
ContentRouterprotects custom tags (e.g.<system-reminder>) into{{HEADROOM_TAG_N}}placeholders before invoking Kompress, CCR can persist the protected placeholder intermediate as the entry'soriginal_contentinstead of the pre-protection source text. A later full retrieve (or proactive expansion / model-initiated retrieve) of such an entry then returns{{HEADROOM_TAG_0}}and the real protected block is lost from the retrieval path. The immediate upstream request is unaffected —restore_tagscorrectly restores the compressed output before it goes upstream; the confirmed corruption is in CCR storage and only surfaces on later retrieval/expansion.This threads the pre-protection
contentthrough asccr_originalso CCR stores the real source text while the model still sees the placeholdered text.Closes #1209
Type of Change
Changes Made
headroom/transforms/content_router.py:_try_ml_compressorpassesccr_original=contenttocompressor.compress(...)only when tags were actually protected (untagged callers keep the historic call shape — backward compatible).headroom/transforms/kompress_compressor.py:compress()gains accr_originalkwarg;compress_batch()gains a per-itemccr_originalslist (validated againstlen(contents)).ccr_originalwhen present, elsecontent: inlinecompress(), single-contentcompress()→compress_batchdelegation,compress_batchsequential fallback, andcompress_batchbatched/GPU path. The stored original's token count is recomputed from the stored text.tests/test_ccr_tag_placeholder_regression.py(new, 5 tests): router boundary forwarding, untagged backward-compat,ccr_originalslength validation, and two store-site tests driving the realcompress()/ batchedcompress_batch()all the way to_store_in_ccr(a tiny fake model stands in for the 274MB ModernBERT).Testing
pytest)ruff check .)mypy headroom)Test Output
Fail-before / pass-after was confirmed against a freshly built Rust
_core: with the fix reverted the new tests fail (router forwards noccr_original→None/placeholder reaches the store;compress_batchrejects the unknownccr_originalskwarg withTypeError); with the fix applied all 5 pass. The surrounding kompress/ccr/router suites stay green (8 unrelated failures are pre-existing — identical with the patch stashed — from missing optional test deps such aspytest-asyncio, not caused by this change).Real Behavior Proof
_coreviamaturin develop, pytest 9.1.1.maturin developto build_core, thenpython -m pytest tests/test_ccr_tag_placeholder_regression.py -q.Nonereaches_store_in_ccr;compress_batchrejectsccr_originals).ruff/mypynot run locally.Review Readiness
Checklist
Additional Notes
Docs/CHANGELOG unchanged: this is an internal CCR correctness fix with no public API or user-facing behavior change beyond correct full-retrieve content.
ruff/mypywere not run in the local build environment.