Investigate prompt-cache hit-rate regression (DeepSeek cost up): per-turn <turn_meta> block may be busting the cacheable prefix

## Summary

A user reports CodeWhale is **costing them more than before** — strongly suggesting a **prompt-cache hit-rate regression**. With DeepSeek context caching, a stable prompt prefix should yield high cache-hit rates (cached input tokens are ~10x cheaper). If recent changes perturb the cacheable prefix, cost rises even with identical usage.

## Why this likely regressed
DeepSeek (and most providers) cache on the **longest common prefix** of the request. Anything dynamic inserted **early** in the message order invalidates the cache for everything after it.

Prime suspects (recent per-turn additions):
1. **`<turn_meta>` block (#3623, commit 1494481).** Active mode + runtime mode policy is now injected into **every user turn**. If this block (or anything near it) varies per turn — timestamps, changing mode label, counters, dynamic policy text — it shifts the prefix and busts the cache. **Check exact placement and whether any field is non-constant turn-to-turn.**
2. **System-prompt / tool-definition ordering.** If tool schemas, model catalog data, or skills/MCP descriptors are injected in a non-deterministic order or with dynamic content, the cacheable system prefix breaks.
3. **Mode policy text reordering** from the recent mode-sync work (#3722) — verify the per-turn policy snippet is stable when the mode hasn't changed.

## Investigation steps (distributed prompt)
- Use the existing **`/cache` debug surface** (per-turn cache-telemetry ring, see `crates/tui/src/tui/app.rs` ~1019) to measure cached vs. uncached input tokens across turns on DeepSeek.
- Diff the actual request bodies of two consecutive turns (same mode, no file changes) and confirm the prefix is **byte-identical** up to the new user message. Any diff before the last user message = cache breaker.
- Bisect: compare cache-hit rate on a build before #3623 vs. after, then before/after #3722.
- Confirm `cache_hit_tokens` / `prompt_cache_hit_tokens` are being read from the DeepSeek usage payload and surfaced in cost math (verify the cost calc applies the cached-token discount).

## Acceptance
- Two consecutive same-mode turns show a high cached-prefix ratio on DeepSeek.
- Cost-per-turn with caching returns to pre-regression levels.
- Add a regression/telemetry assertion that the cacheable prefix is stable across turns when mode/context are unchanged.

Refs #3623, #3722.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate prompt-cache hit-rate regression (DeepSeek cost up): per-turn <turn_meta> block may be busting the cacheable prefix #3738

Summary

Why this likely regressed

Investigation steps (distributed prompt)

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Investigate prompt-cache hit-rate regression (DeepSeek cost up): per-turn <turn_meta> block may be busting the cacheable prefix #3738

Description

Summary

Why this likely regressed

Investigation steps (distributed prompt)

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions