fix(ccr): store pre-protection original, not tag placeholder, in CCR by yulin0629 · Pull Request #1208 · chopratejas/headroom

yulin0629 · 2026-06-20T19:15:52Z

Description

When ContentRouter protects custom tags (e.g. <system-reminder>) into {{HEADROOM_TAG_N}} placeholders before invoking Kompress, CCR can persist the protected placeholder intermediate as the entry's original_content instead of the pre-protection source text. A later full retrieve (or proactive expansion / model-initiated retrieve) of such an entry then returns {{HEADROOM_TAG_0}} and the real protected block is lost from the retrieval path. The immediate upstream request is unaffected — restore_tags correctly restores the compressed output before it goes upstream; the confirmed corruption is in CCR storage and only surfaces on later retrieval/expansion.

This threads the pre-protection content through as ccr_original so CCR stores the real source text while the model still sees the placeholdered text.

Closes #1209

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Performance improvement
Code refactoring (no functional changes)

Changes Made

headroom/transforms/content_router.py: _try_ml_compressor passes ccr_original=content to compressor.compress(...) only when tags were actually protected (untagged callers keep the historic call shape — backward compatible).
headroom/transforms/kompress_compressor.py: compress() gains a ccr_original kwarg; compress_batch() gains a per-item ccr_originals list (validated against len(contents)).
All four CCR store sites store ccr_original when present, else content: inline compress(), single-content compress()→compress_batch delegation, compress_batch sequential fallback, and compress_batch batched/GPU path. The stored original's token count is recomputed from the stored text.
tests/test_ccr_tag_placeholder_regression.py (new, 5 tests): router boundary forwarding, untagged backward-compat, ccr_originals length validation, and two store-site tests driving the real compress() / batched compress_batch() all the way to _store_in_ccr (a tiny fake model stands in for the 274MB ModernBERT).

Testing

Unit tests pass (pytest)
Linting passes (ruff check .)
Type checking passes (mypy headroom)
New tests added for new functionality
Manual testing performed

Test Output

$ .venv/bin/python -m pytest tests/test_ccr_tag_placeholder_regression.py -q
============================= test session starts ==============================
platform darwin -- Python 3.12.12, pytest-9.1.1, pluggy-1.6.0
rootdir: /Users/.../headroom.worktrees/ccr-tag-placeholder
configfile: pyproject.toml
plugins: anyio-4.14.0
collected 5 items

tests/test_ccr_tag_placeholder_regression.py .....                       [100%]

========================= 5 passed, 1 warning in 0.15s =========================

Fail-before / pass-after was confirmed against a freshly built Rust _core: with the fix reverted the new tests fail (router forwards no ccr_original → None/placeholder reaches the store; compress_batch rejects the unknown ccr_originals kwarg with TypeError); with the fix applied all 5 pass. The surrounding kompress/ccr/router suites stay green (8 unrelated failures are pre-existing — identical with the patch stashed — from missing optional test deps such as pytest-asyncio, not caused by this change).

Real Behavior Proof

Environment: macOS (darwin), Python 3.12.12, locally built Rust _core via maturin develop, pytest 9.1.1.
Exact command / steps: maturin develop to build _core, then python -m pytest tests/test_ccr_tag_placeholder_regression.py -q.
Observed result: 5 passed with the fix applied; the same suite fails before the fix (placeholder/None reaches _store_in_ccr; compress_batch rejects ccr_originals).
Not tested: end-to-end live proxy full-retrieve against a 274MB ModernBERT model (tests use a fake model to keep them deterministic and offline); ruff/mypy not run locally.

Note: this fixes new CCR writes. Pre-existing entries written before the fix keep their placeholder original_content until they expire.

Review Readiness

I have performed a self-review
This PR is ready for human review

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have updated the CHANGELOG.md if applicable

Additional Notes

Docs/CHANGELOG unchanged: this is an internal CCR correctness fix with no public API or user-facing behavior change beyond correct full-retrieve content. ruff/mypy were not run in the local build environment.

ContentRouter protects custom tags (e.g. <system-reminder>) into {{HEADROOM_TAG_N}} placeholders before invoking Kompress. The Kompress CCR store persisted that placeholdered intermediate as original_content, so a later full retrieve returned {{HEADROOM_TAG_0}} and the protected tag block was lost — even though restore_tags correctly fixed the immediate upstream output. Pass the pre-protection content as ccr_original through compress()/compress_batch() so CCR stores the real source text; the model still receives the placeholdered text. Covers all four CCR store sites (inline compress, single-content batch delegation, batch sequential fallback, batch GPU path). Router only adds the kwarg when tags were actually protected, keeping direct callers unaffected. AI-Agent: Claude Code AI-Session-IDs: d065f19d-7b35-45df-8adb-cf42f1bcdd10

github-actions · 2026-06-20T19:16:06Z

PR governance

This PR follows the template and is marked ready for human review.

Copilot

Pull request overview

This PR fixes a CCR data corruption edge case where Kompress inputs that were tag-protected (e.g. <system-reminder>…</system-reminder> → {{HEADROOM_TAG_N}}) could cause CCR to persist the placeholder intermediate as original_content, breaking later full retrieval/expansion.

Changes:

Thread the pre-tag-protection source through the Kompress API (ccr_original / ccr_originals) so CCR stores the real original text while the model still sees placeholders.
Update Kompress CCR store sites (single + batch paths) to store ccr_original* when provided, and recompute stored-token counts from the stored text.
Add a regression test suite to pin router→compressor plumbing and CCR store-site behavior without loading the full model.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`headroom/transforms/content_router.py`	Passes `ccr_original=content` only when tag protection is active to keep CCR originals lossless.
`headroom/transforms/kompress_compressor.py`	Adds `ccr_original*` parameters and uses them at CCR store sites to avoid persisting tag placeholders as originals.
`tests/test_ccr_tag_placeholder_regression.py`	New regression coverage for router forwarding, API validation, and CCR store-site correctness.

Comments suppressed due to low confidence (2)

headroom/transforms/kompress_compressor.py:988

When ccr_original is provided, CCR stores ccr_source (and you compute ccr_source_tokens), but the retrieval marker still reports n_words from the placeholder content. This makes the marker’s "N items" inconsistent with what will be returned on full retrieval and can skew downstream marker-based accounting.

                    result.compressed += (
                        f"\n[{n_words} items compressed to {compressed_count}."
                        f" Retrieve more: hash={cache_key}]"
                    )

headroom/transforms/kompress_compressor.py:1265

Same issue on the batched compress_batch() CCR store path: when ccr_originals is used, the marker still reports n_words from the placeholder content, not the stored ccr_source token count. The marker should reflect the size of the stored original to keep retrieval/metrics consistent.

                    result.compressed += (
                        f"\n[{n_words} items compressed to {compressed_count}."
                        f" Retrieve more: hash={cache_key}]"
                    )

+"""Regression: CCR must store the pre-protection original, not the
+``{{HEADROOM_TAG_N}}`` placeholder intermediate, for tag-protected Kompress inputs.
+
+Before the fix, ``ContentRouter._try_ml_compressor`` passed the tag-protected
+(placeholdered) text into ``KompressCompressor.compress`` without an original, so


Copilot AI review requested due to automatic review settings June 20, 2026 19:15

github-actions Bot added the status: needs author action Pull request body or readiness checklist still needs author updates label Jun 20, 2026

Copilot started reviewing on behalf of yulin0629 June 20, 2026 19:16 View session

yulin0629 mentioned this pull request Jun 20, 2026

CCR full retrieve returns tag placeholders ({{HEADROOM_TAG_N}}) instead of original content #1209

Open

Copilot AI reviewed Jun 20, 2026

View reviewed changes

github-actions Bot added status: ready for review Pull request body is complete and the author marked it ready for human review and removed status: needs author action Pull request body or readiness checklist still needs author updates labels Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(ccr): store pre-protection original, not tag placeholder, in CCR#1208

fix(ccr): store pre-protection original, not tag placeholder, in CCR#1208
yulin0629 wants to merge 1 commit into
chopratejas:mainfrom
yulin0629:fix/ccr-tag-placeholder

yulin0629 commented Jun 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yulin0629 commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Changes Made

Testing

Test Output

Real Behavior Proof

Review Readiness

Checklist

Additional Notes

Uh oh!

github-actions Bot commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR governance

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yulin0629 commented Jun 20, 2026 •

edited

Loading

github-actions Bot commented Jun 20, 2026 •

edited

Loading