You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we wrote Discussion #13 on April 11, we listed 8 detection signals Anthropic's classifier uses to route requests to five_hour (subscription) vs overage. One of those was "system prompt: Exactly 3 blocks."
Re-testing today against CC v2.1.123 + Opus 4.7 says that's wrong. The system prompt isn't a fingerprint signal at all — content, length, and block count can change freely without flipping billing. Two test rounds, 16 real upstream requests, 100% routed to five_hour.
What that opens up: dario users can replace CC's verbose system prompt with their own, modify CC's behavioral defaults, or strip the verbosity caps entirely — all on subscription billing, no overage flip.
The classifier mechanism is still real. The signal list was inflated. The actual fingerprint is narrower than we said.
Test 1 — system prompt content invariance (7 variants)
Captured CC v2.1.123's actual outbound /v1/messages?beta=true body. Held everything else identical (model, tools, effort, max_tokens, body field order, metadata, anthropic-beta with oauth-2025-04-20 for OAuth path, OAuth bearer from a Max account). Mutated only system[]:
#
Variant
Sys total chars
Billing
1
Control (CC verbatim)
27,310
five_hour
2
Prepend single character
27,311
five_hour
3
Word substitution: concise → brief
27,306
five_hour
4
Remove a sentence
27,049
five_hour
5
Replace block 2 with 321-char custom prompt
321
five_hour
6
Add a 4th block (3 → 4)
27,369
five_hour
7
Length padding (+500 chars)
27,812
five_hour
Variant #5 is load-bearing. We dropped CC's entire 27k system prompt for a 321-char custom one and kept subscription billing.
Test 2 — behavioral capability delta when CC's constraints are stripped
Test 1 established the classifier ignores system prompt content. Natural follow-on: with that ceiling lifted, what does the model actually DO when CC's behavioral constraints are removed?
Two strip levels:
Partial strip — removes "# Tone and style" and "# Text output" sections and several scope-discipline / commenting bullets in "# Doing tasks." Keeps every "IMPORTANT:" alignment line, keeps tool descriptions, keeps "# Executing actions with care."
Aggressive strip — additionally removes prompt-level alignment reminders (the "IMPORTANT:" lines that re-state RLHF-trained refusal categories) and most of "# Executing actions with care." Critically does not remove RLHF — alignment is trained, not prompted; the model keeps refusing harmful content because that's where alignment lives.
Three test prompts that should hit verbosity / format / scope-discipline constraints:
Prompt
Control chars
Partial-strip chars
Aggressive-strip chars
Code-with-comments task
2,970
0 (model picked tool_use)
5,675
Detailed technical explanation
3,851
4,546 (+18%)
4,585 (+19%)
Open-ended decision question
401
1,092 (+172%)
1,116 (+178%)
All 9 routed to five_hour. Model behavior on benign tasks remained aligned across every variant.
Two findings worth pulling out:
The big behavioral lever is the verbosity / format / scope-discipline language, not the alignment language. Removing "Tone and style" + "Text output" produces 1.18-2.78× capability change on open-ended work. Aggressive strip adds <3% over partial.
Prompt-level alignment reminders contribute approximately zero to the model's refusal behavior on benign tasks. They're redundant with RLHF — the model's "refuse to help with destructive techniques" is trained, and the prompt's "IMPORTANT: Refuse..." line is just restating it. Stripping the restatement doesn't unlock harmful behavior because the trained refusal is what enforces it.
Wrong: "System prompt: Exactly 3 blocks" as a detection signal.
Still presumed-fingerprint (untested by these rounds): tool names, billing tag, JSON field order, effort default (now xhigh for Opus 4.7 / high for others per Anthropic's April 23 postmortem), max_tokens, thinking type, non-CC field scrubbing.
Mechanism still real: behavioral classifier exists, five_hour vs overage routing depends on request shape. We just over-counted inputs, and the actual axis is narrower than #13 implied.
What dario users gain
Three opt-in feature surfaces, each backed by the data above:
Replace CC's 27k system prompt with your own — saves up to ~27,000 input tokens per non-cached request; significant on multi-agent stacks.
Strip CC's behavioral defaults — verbosity caps, no-comments-by-default, scope discipline. CC's UX choices, not model alignment. Your tool, your defaults.
Add operator instructions on top of CC's prompt — agent personas, tool-use heuristics, output formatters; alongside the rest of dario's fingerprint replay.
All three preserve subscription billing. Aggressive alignment-language stripping isn't part of the recommended surface — the data shows it doesn't add useful capability, and "your tool, your defaults" is a cleaner story than "drop safety."
Landing in dario [vX.Y.Z] (separate ship, watch the next release).
Each captures CC's outbound body via loopback MITM, mutates per variant, sends to api.anthropic.com with OAuth bearer (read directly from ~/.claude/.credentials.json), reads the anthropic-ratelimit-unified-representative-claim response header. Reproducible in <60s against any installed CC + Max OAuth.
Maintainer-only diagnostics — not part of npm test, not invoked from CI, not on dario's runtime path.
Caveats
Tested against Opus 4.7 only (CC's current default). Sonnet 4.6 and other models pending.
Test prompts in test 2 are single-turn. Multi-turn / tool-use sequences may behave differently — pending.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
TL;DR
When we wrote Discussion #13 on April 11, we listed 8 detection signals Anthropic's classifier uses to route requests to
five_hour(subscription) vsoverage. One of those was "system prompt: Exactly 3 blocks."Re-testing today against CC v2.1.123 + Opus 4.7 says that's wrong. The system prompt isn't a fingerprint signal at all — content, length, and block count can change freely without flipping billing. Two test rounds, 16 real upstream requests, 100% routed to
five_hour.What that opens up: dario users can replace CC's verbose system prompt with their own, modify CC's behavioral defaults, or strip the verbosity caps entirely — all on subscription billing, no overage flip.
The classifier mechanism is still real. The signal list was inflated. The actual fingerprint is narrower than we said.
Test 1 — system prompt content invariance (7 variants)
Captured CC v2.1.123's actual outbound
/v1/messages?beta=truebody. Held everything else identical (model, tools, effort, max_tokens, body field order, metadata, anthropic-beta withoauth-2025-04-20for OAuth path, OAuth bearer from a Max account). Mutated onlysystem[]:five_hourfive_hourconcise→brieffive_hourfive_hourfive_hourfive_hourfive_hourVariant #5 is load-bearing. We dropped CC's entire 27k system prompt for a 321-char custom one and kept subscription billing.
Test 2 — behavioral capability delta when CC's constraints are stripped
Test 1 established the classifier ignores system prompt content. Natural follow-on: with that ceiling lifted, what does the model actually DO when CC's behavioral constraints are removed?
Two strip levels:
Three test prompts that should hit verbosity / format / scope-discipline constraints:
All 9 routed to
five_hour. Model behavior on benign tasks remained aligned across every variant.Two findings worth pulling out:
What this changes about #13
Wrong: "System prompt: Exactly 3 blocks" as a detection signal.
Still presumed-fingerprint (untested by these rounds): tool names, billing tag, JSON field order, effort default (now
xhighfor Opus 4.7 /highfor others per Anthropic's April 23 postmortem), max_tokens, thinking type, non-CC field scrubbing.Mechanism still real: behavioral classifier exists,
five_hourvsoveragerouting depends on request shape. We just over-counted inputs, and the actual axis is narrower than #13 implied.What dario users gain
Three opt-in feature surfaces, each backed by the data above:
All three preserve subscription billing. Aggressive alignment-language stripping isn't part of the recommended surface — the data shows it doesn't add useful capability, and "your tool, your defaults" is a cleaner story than "drop safety."
Landing in dario [vX.Y.Z] (separate ship, watch the next release).
Methodology
Test scripts (committed in PR #171):
scripts/test-system-prompt-mods.mjs— Test 1, the 7-variant system prompt mutation ladder.scripts/test-constraint-removal.mjs— Test 2, the 3-prompt × 3-strip-level capability ladder.scripts/capture-full-body.mjs— captures CC's actual wire values for any installed CC version.Each captures CC's outbound body via loopback MITM, mutates per variant, sends to
api.anthropic.comwith OAuth bearer (read directly from~/.claude/.credentials.json), reads theanthropic-ratelimit-unified-representative-claimresponse header. Reproducible in <60s against any installed CC + Max OAuth.Maintainer-only diagnostics — not part of
npm test, not invoked from CI, not on dario's runtime path.Caveats
What's next
Beta Was this translation helpful? Give feedback.
All reactions