A Pi coding-agent extension that summarizes completed tool-call batches, replaces raw tool outputs with short stubs in future context, and lets the LLM recover any original via the context_tree_query tool.
The session JSONL file is never modified β pruning only affects what each next request sees.
Fork of championswimmer/pi-context-prune with additional pre-flush safeguards, agent-message batching, and tag-pinned release flow.
π For the algorithm, design rationale, prompt-cache interaction, and the research behind summarization-based context management, see PRUNING.md.
This fork is consumed as a pi package via a git tag pin β same scheme as sibling pi-superpowers.
User scope (all repos under your pi profile):
pi install git:github.com/jjuraszek/pi-context-prune@v1.0.0Project scope (current repo only, committable via .pi/settings.json):
pi install -l git:github.com/jjuraszek/pi-context-prune@v1.0.0Try without installing:
pi -e git:github.com/jjuraszek/pi-context-prune@v1.0.0From a local checkout (for hacking on the extension itself):
git clone git@github.com:jjuraszek/pi-context-prune.git ~/repos/pi-context-prune
cd ~/path/to/your/repo
pi install -l ~/repos/pi-context-prune
# or one-shot, no install:
pi -e ~/repos/pi-context-prune/index.tsUpgrade by re-running pi install with a newer @vX.Y.Z. Remove with pi remove pi-context-prune. Once installed, the extension auto-loads on every pi invocation; no flags needed.
Upstream
championswimmer/pi-context-prunedoes publish to npm. This fork does not β pin a tag instead. See CHANGELOG.md for what diverges.
/pruner on # enable pruning
/pruner status # see current mode + cumulative cost
/pruner model openai/gpt-4.1-mini # pick a cheap summarizer
/pruner now # flush pending batches immediatelyBy default the extension is off. Enable it once and it stays enabled across sessions in the same pi agent directory.
Two trigger modes. The mode controls when summarization fires; the algorithm is the same in each.
| Mode | Trigger | Cache impact | Use when |
|---|---|---|---|
agent-message (default) |
When the agent sends a final text-only reply | One cache rewrite per task batch | Normal coding-agent work β best balance |
on-demand |
Only when you run /pruner now |
None until you ask | Long investigations; manual control |
Why agent-message is the default: provider prefix caches (Anthropic, OpenAI, Bedrock, vLLM) only hit when the prompt prefix matches exactly. Every prune rewrites that prefix. Batching tool turns and pruning once per agent reply means roughly one cache miss per task instead of one per turn. See PRUNING.md Β§ The Sweet Spot for the full argument.
Settings live under the contextPrune key in <agent-dir>/settings.json (i.e. pi's own settings file). <agent-dir> is $PI_CODING_AGENT_DIR if set, otherwise ~/.pi/agent. Each pi preset gets its own settings, so you can run different summarizer models per preset.
{
"contextPrune": {
"enabled": false,
"showPruneStatusLine": true,
"summarizerModel": "default",
"summarizerThinking": "default",
"pruneOn": "agent-message",
"batchingMode": "turn",
"quietOversizedSkips": false,
"minBatchChars": 1000,
"protectedTools": [],
"dedupByContentHash": true,
"autoBudgetThreshold": null,
"spillThreshold": 65536,
"spillPreviewBytes": 2048,
"budgetTurnDelta": null,
"chainCompression": {
"enabled": true,
"rollingWindow": 3,
"stripFinalAssistantThinking": true,
"fuseRangeSummary": true
},
"thinkingStrip": {
"enabled": true,
"keepLastTurns": 16
}
}
}| Key | Values | Default | Notes |
|---|---|---|---|
enabled |
true / false |
false |
Master switch |
showPruneStatusLine |
true / false |
true |
Footer widget + queued-turn notifications |
summarizerModel |
"default" or "provider/model-id" |
"default" |
default = your active pi model. See Choosing a summarizer model |
summarizerThinking |
default/off/minimal/low/medium/high/xhigh |
default |
Provider-specific reasoning effort knob |
pruneOn |
see table above | agent-message |
Trigger mode |
batchingMode |
turn / agent-message |
turn |
How coarse each summary is (independent of pruneOn) |
quietOversizedSkips |
true / false |
false |
Silences skipped-oversized / skipped-trivial info notifications |
minBatchChars |
non-negative integer, 0 disables |
1000 |
Pre-flush guard β batches smaller than this skip the LLM entirely |
protectedTools |
string[] |
[] |
Never-pruned tool names (e.g. ["todowrite","todoread"]). When a protected tool's chain is range-compressed, its output is preserved verbatim inside the <compressed-chain> block as <protected-output> β protected outputs are never lost. |
dedupByContentHash |
true / false |
true |
Re-reads of identical (toolName, content) skip the LLM and alias the original |
autoBudgetThreshold |
fraction 0β1, or null |
null |
Token-budget auto-flush: force a prune when context usage reaches this share of the window, regardless of pruneOn. 0.8 = 80%, not 80. null = off. See Token-budget auto-flush |
spillThreshold |
positive integer | 65536 |
Minimum chars (resultText.length) for a single tool result to be spilled eagerly to a sidecar file at capture time rather than waiting for normal summarization. Non-positive / invalid values fall back to the default; to effectively disable spilling, set it above any result you expect. See Spilled outputs |
spillPreviewBytes |
non-negative integer | 2048 |
Head preview (bytes) kept inline in the stub and index record for a spilled result. Full body is on disk. |
budgetTurnDelta |
fraction 0β1, or null |
null |
Force a flush when a single turn's context-usage fraction jumps by at least this amount, ORed with autoBudgetThreshold. Catches sudden spikes a static threshold would miss until the next turn. null = off. |
chainCompression.enabled |
true / false |
true |
Master toggle for chain-level range compression |
chainCompression.rollingWindow |
positive integer | 3 |
Keep this many most-recent closed chains raw; compress older ones |
chainCompression.stripFinalAssistantThinking |
true / false |
true |
Strip thinking blocks from the kept final text-only assistant when compressing |
chainCompression.fuseRangeSummary |
true / false |
true |
Fuse a compressed chain's per-batch summaries into one cohesive LLM summary (one extra summarizer call per multi-batch span); off keeps the per-batch concatenation |
purgeErrors.enabled |
true / false |
true |
Replace failed toolCall argument bodies with compact stubs after cooldown |
purgeErrors.cooldownTurns |
positive integer | 2 |
Turns to wait after a tool error before purging its argument body |
purgeErrors.minArgChars |
non-negative integer | 500 |
Only purge arg bodies at least this many characters long |
thinkingStrip.enabled |
true / false |
true |
Strip thinking blocks from assistant turns older than the last keepLastTurns |
thinkingStrip.keepLastTurns |
positive integer | 16 |
Keep thinking on the last N assistant turns; strip older. Counts assistant turns, not chains. No-op under N turns |
See PRUNING.md Β§ Chain Compression, PRUNING.md Β§ Error Purge, and PRUNING.md Β§ Main-loop Thinking Strip for the full algorithms.
The three pre-flush features (minBatchChars, protectedTools, dedupByContentHash) are explained in PRUNING.md Β§ Pre-flush Pipeline & Safeguards. They run BEFORE any summarizer LLM call and can each drop a batch outright while still advancing the prune frontier.
When autoBudgetThreshold is set to a value in (0, 1], the extension checks context usage at the end of every tool-using turn. If tokens / contextWindow reaches the threshold, ALL pending batches are flushed immediately β regardless of pruneOn mode. This is an additional trigger layered on top of pruneOn, not a replacement.
0.8means 80% of the context window β it is a fraction, not a percentage.0.8 β 80.- The trigger is a no-op when
tokensisnull(right after a provider-side compaction); it resumes once usage is known again. - Editable live via
/pruner settings(row "Auto-flush at context %", presets Off / 60 / 70 / 80 / 90%). - Default
null= off.
Inspired by DCP's maxContextLimit nudging; simplified to a single threshold that forces a flush rather than separate nudge/force levels.
Single tool results larger than spillThreshold chars are written to <session-dir>/<sessionId>-blobs/<toolCallId>.txt at capture time and replaced in context with a short stub (tool name, byte size, head preview, file path). The full body is recoverable via the native read tool at the embedded path (offset/limit supported) or via context_tree_query by id, which falls back to the inline preview if the sidecar is missing. Moving a session .jsonl without its -blobs/ directory loses only the giant-blob recovery path; bodies under spillThreshold stay inline in the index entry as usual.
The default setting reuses whatever model you have active in pi β convenient but wasteful, since summary writing doesn't need a top-tier coding model. Picking the smallest/fastest model on your plan saves both latency and cost.
| Plan | Suggested summarizer |
|---|---|
| OpenAI / Codex / Copilot | openai/gpt-4.1-mini, google/gemini-2.5-flash, xai/grok-3-fast |
| OpenRouter | openrouter/qwen/qwen3-30b-a3b (cheap MoE) |
| Anthropic direct | anthropic/claude-haiku-3-5 |
| Google AI direct | google/gemini-2.5-flash |
Set it from the slash command (saves immediately):
/pruner model openai/gpt-4.1-mini
/pruner thinking low
# or both in one go:
/pruner model openai/gpt-4.1-mini:low| Command | Effect |
|---|---|
/pruner |
Interactive picker over all subcommands |
/pruner settings |
Settings overlay (toggle / cycle every option) |
/pruner on / off |
Enable / disable pruning |
/pruner status |
Show mode, model, trigger, cumulative stats |
/pruner stats |
Detailed cumulative summarizer token/cost stats |
/pruner model [id\[:thinking\]] |
Get / set summarizer model (and optionally thinking level) |
/pruner thinking [level] |
Get / set summarizer reasoning effort |
/pruner prune-on [mode] |
Get / set trigger mode |
/pruner batching [mode] |
Get / set batching granularity (turn / agent-message) |
/pruner protected-tools [names] |
Show or edit the never-pruned tool allowlist (comma- or space-separated; none clears) |
/pruner min-batch-chars [n] |
Show or set the pre-flush trivial-batch threshold (0 disables) |
/pruner dedup [on|off|status] |
Toggle pre-flush content-hash dedup |
/pruner tree |
Foldable browser of pruned tool calls; Ctrl-O opens the full summary in an overlay |
/pruner compact |
Retroactively compress every eligible closed chain (bypasses rollingWindow) |
/pruner now |
Flush pending batches immediately with a multi-row progress widget above the input |
/pruner help |
Full help text |
context_tree_query β always available when the extension is loaded. Pruned summaries end with short refs like Summarized tool refs: \t1`, `t2`. Use `context_tree_query` with these refs to retrieve the original full outputs.The model passes those refs (or fulltoolCallId`s) and gets back the original tool result text from the session index. Content-hash-deduped duplicates resolve to the original's record automatically.
A footer widget shows the current state, controlled by showPruneStatusLine:
prune: OFF (On agent message)β disabled, showing what mode would activateprune: ON (On agent message)β active, no flushes yetprune: ON (On agent message) β β1.2k β340 $0.003β active with cumulative input/output tokens and costprune: 3 pendingβ batches queued, waiting for the triggerprune: summarizingβ¦β flush in progress
Setting showPruneStatusLine: false hides the widget and silences the queued-turn notice; pruning still runs.
- pi-context-usage β visualizes current context size and breaks it down by message type. Useful for seeing how much space pruning saved.
- pi-cache-graph β plots provider prefix-cache hits/misses in real time. Useful for tuning your
pruneOnmode against actual cache behavior.
- Pruning only applies to batches captured while enabled. Enabling mid-session does not retroactively summarize earlier turns.
- Summarizer calls are synchronous inside
turn_end(ormessage_endforagent-messagemode), so they add latency between turns proportional to the summarizer model's response time. Pick a fast model. - Content-hash dedup only matches against records already in the indexer (cross-flush). Two identical outputs within the same flush are not deduped β both go through the summarizer.
- The tree browser does not inline original tool outputs β use
context_tree_queryfor that.
- Anthropic prompt caching: https://docs.claude.com/en/docs/build-with-claude/prompt-caching
- AWS Bedrock prompt caching: https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html
- OpenAI prompt caching: https://platform.openai.com/docs/guides/prompt-caching
- Research backing summarization-based context management: see PRUNING.md Β§ Research Evidence