You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Opus 4.6 isn't dumber. Your context is poisoned and your tokens are being wasted. Here's what's actually happening and how to fix it.
The Two Problems Everyone Is Hitting
1. "Opus 4.6 gives worse answers than 4.5"
This is real — but it's not a model regression. It's context degradation from thinking block accumulation.
Opus 4.6 introduced adaptive thinking — the model generates internal reasoning blocks (thinking type) on every response. These blocks are large (2K-20K tokens each) and by default they stay in your conversation history. By turn 15-20 of a multi-turn session, you can have 100K+ tokens of stale thinking noise sitting in context alongside your actual conversation.
The model's effective reasoning degrades well before the 1M context limit. Users report quality drops at 20-40% context fill. This isn't a model issue — it's the same model drowning in its own prior reasoning traces.
Why Opus 4.5 didn't have this problem: No adaptive thinking. Shorter default context. Less accumulated noise per turn.
2. "My Max subscription runs out in hours"
Three compounding causes:
Thinking tokens bill as output. Every adaptive thinking response generates thousands of reasoning tokens at output pricing. With effort: high, a single response can burn 20K+ thinking tokens before writing a single line of code. Ten of those and you've consumed 200K output tokens from thinking you never see.
Tool definitions are invisible context bloat. MCP servers inject tool schemas into every request. Each server adds ~18K tokens. If you have 3 MCP servers configured, that's 54K tokens of tool definitions sent on every single message — before your actual conversation even starts.
Autocompact is miscalibrated. Claude Code's autocompact triggers at ~76K tokens despite the 1M context window. This threshold was calibrated for the old 200K window and was never updated. It fires too early, compacts aggressively, and then the next turn re-expands with tool definitions anyway.
Claude Code defaults to medium effort. This is deliberate — it balances reasoning quality against token consumption. Most third-party tools default to high or don't set it at all (which defaults to high). This alone can 2-3x your token burn rate.
This strips thinking blocks from prior turns in the conversation history before sending the next request. Without it, every prior thinking trace stays in context, accumulating noise and consuming input tokens on every subsequent request.
Adaptive thinking: adaptive, not enabled
{
"thinking": { "type": "adaptive" }
}
adaptive lets the model decide when to think and how deeply. enabled with a fixed budget forces thinking on every response. Most frameworks that "support thinking" use enabled with high budgets — burning tokens on trivial responses that don't benefit from extended reasoning.
Max tokens: 64000
Claude Code sends max_tokens: 64000. Some frameworks send 16000 or even 8192, which constrains the model's response space. Others send 128000, which inflates the thinking budget allocation. 64000 is the sweet spot the binary uses.
What dario Does About This
dario injects all of the above automatically. When a request comes through the proxy:
Parameter
What your client probably sends
What dario injects
thinking
{ type: "enabled", budget: 32000 }
{ type: "adaptive" }
output_config.effort
high (or nothing, defaults to high)
medium
context_management
Nothing
clear_thinking_20251015 with keep: all
max_tokens
8192-16000
64000
Billing tag
Nothing
Full Claude Code fingerprint
dario only sets defaults — it never overrides values your client explicitly sends. If you want effort: high for a specific task, send it and dario will pass it through.
Environment Variables You Can Set
From the Claude Code binary, these env vars control the behavior directly if you're running Claude Code (not through dario):
Variable
What it does
Default
CLAUDE_CODE_EFFORT_LEVEL
Override effort level (low, medium, high)
medium
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE
Autocompact trigger threshold (0-100)
~76% of context window
CLAUDE_CODE_AUTO_COMPACT_WINDOW
Set the context window size for compaction calculation
200000
If you're using Claude Code directly and burning tokens fast, try:
export CLAUDE_CODE_EFFORT_LEVEL=medium
If you're using dario, you don't need to set anything — these are already the defaults.
The Actual Fix for "Opus 4.6 Is Dumb"
Clear thinking history. The API's context_management: clear_thinking does NOT reduce input token billing (verified: same input tokens with or without it). You must strip thinking blocks client-side, or use dario v2.9.0 which does it automatically. Restart conversations every 30-40 turns for best quality.
Use effort: medium. The quality difference between medium and high is marginal for most tasks. The token difference is ~2x (verified: 2.22x on complex prompts).
Audit your MCP servers. Run claude mcp list and count the tools. Each tool definition is ~500 tokens. 30 tools = 15K tokens per message, before you say anything. Remove servers you're not actively using.
Don't revert to Opus 4.5. The model is better. The defaults around it are worse. Fix the defaults.
If you're running agents through dario, all of this is handled automatically. npm install -g @askalf/dario
Documented from binary reverse engineering of Claude Code v2.1.100. See Discussion #8 for the full fingerprint analysis.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
TL;DR
Opus 4.6 isn't dumber. Your context is poisoned and your tokens are being wasted. Here's what's actually happening and how to fix it.
The Two Problems Everyone Is Hitting
1. "Opus 4.6 gives worse answers than 4.5"
This is real — but it's not a model regression. It's context degradation from thinking block accumulation.
Opus 4.6 introduced adaptive thinking — the model generates internal reasoning blocks (
thinkingtype) on every response. These blocks are large (2K-20K tokens each) and by default they stay in your conversation history. By turn 15-20 of a multi-turn session, you can have 100K+ tokens of stale thinking noise sitting in context alongside your actual conversation.The model's effective reasoning degrades well before the 1M context limit. Users report quality drops at 20-40% context fill. This isn't a model issue — it's the same model drowning in its own prior reasoning traces.
Why Opus 4.5 didn't have this problem: No adaptive thinking. Shorter default context. Less accumulated noise per turn.
2. "My Max subscription runs out in hours"
Three compounding causes:
Thinking tokens bill as output. Every adaptive thinking response generates thousands of reasoning tokens at output pricing. With
effort: high, a single response can burn 20K+ thinking tokens before writing a single line of code. Ten of those and you've consumed 200K output tokens from thinking you never see.Tool definitions are invisible context bloat. MCP servers inject tool schemas into every request. Each server adds ~18K tokens. If you have 3 MCP servers configured, that's 54K tokens of tool definitions sent on every single message — before your actual conversation even starts.
Autocompact is miscalibrated. Claude Code's autocompact triggers at ~76K tokens despite the 1M context window. This threshold was calibrated for the old 200K window and was never updated. It fires too early, compacts aggressively, and then the next turn re-expands with tool definitions anyway.
What Real Claude Code Actually Sends
From our reverse engineering of the Claude Code binary, here's what Claude Code does to manage this:
Effort:
medium, nothigh{ "output_config": { "effort": "medium" } }Claude Code defaults to
mediumeffort. This is deliberate — it balances reasoning quality against token consumption. Most third-party tools default tohighor don't set it at all (which defaults tohigh). This alone can 2-3x your token burn rate.Context management:
clear_thinking{ "context_management": { "edits": [{ "type": "clear_thinking_20251015", "keep": "all" }] } }This strips thinking blocks from prior turns in the conversation history before sending the next request. Without it, every prior thinking trace stays in context, accumulating noise and consuming input tokens on every subsequent request.
Adaptive thinking:
adaptive, notenabled{ "thinking": { "type": "adaptive" } }adaptivelets the model decide when to think and how deeply.enabledwith a fixed budget forces thinking on every response. Most frameworks that "support thinking" useenabledwith high budgets — burning tokens on trivial responses that don't benefit from extended reasoning.Max tokens:
64000Claude Code sends
max_tokens: 64000. Some frameworks send16000or even8192, which constrains the model's response space. Others send128000, which inflates the thinking budget allocation.64000is the sweet spot the binary uses.What dario Does About This
dario injects all of the above automatically. When a request comes through the proxy:
thinking{ type: "enabled", budget: 32000 }{ type: "adaptive" }output_config.efforthigh(or nothing, defaults tohigh)mediumcontext_managementclear_thinking_20251015withkeep: allmax_tokensdario only sets defaults — it never overrides values your client explicitly sends. If you want
effort: highfor a specific task, send it and dario will pass it through.Environment Variables You Can Set
From the Claude Code binary, these env vars control the behavior directly if you're running Claude Code (not through dario):
CLAUDE_CODE_EFFORT_LEVELlow,medium,high)mediumCLAUDE_AUTOCOMPACT_PCT_OVERRIDECLAUDE_CODE_AUTO_COMPACT_WINDOWIf you're using Claude Code directly and burning tokens fast, try:
export CLAUDE_CODE_EFFORT_LEVEL=mediumIf you're using dario, you don't need to set anything — these are already the defaults.
The Actual Fix for "Opus 4.6 Is Dumb"
Clear thinking history. The API's
context_management: clear_thinkingdoes NOT reduce input token billing (verified: same input tokens with or without it). You must strip thinking blocks client-side, or use dario v2.9.0 which does it automatically. Restart conversations every 30-40 turns for best quality.Use
effort: medium. The quality difference between medium and high is marginal for most tasks. The token difference is ~2x (verified: 2.22x on complex prompts).Audit your MCP servers. Run
claude mcp listand count the tools. Each tool definition is ~500 tokens. 30 tools = 15K tokens per message, before you say anything. Remove servers you're not actively using.Don't revert to Opus 4.5. The model is better. The defaults around it are worse. Fix the defaults.
If you're running agents through dario, all of this is handled automatically.
npm install -g @askalf/darioDocumented from binary reverse engineering of Claude Code v2.1.100. See Discussion #8 for the full fingerprint analysis.
Beta Was this translation helpful? Give feedback.
All reactions