Re-testing #13: the system prompt is not a fingerprint signal #172

askalf · 2026-04-29T18:55:10Z

askalf
Apr 29, 2026
Maintainer

TL;DR

When we wrote Discussion #13 on April 11, we listed 8 detection signals Anthropic's classifier uses to route requests to five_hour (subscription) vs overage. One of those was "system prompt: Exactly 3 blocks."

Re-testing today against CC v2.1.123 + Opus 4.7 says that's wrong. The system prompt isn't a fingerprint signal at all — content, length, and block count can change freely without flipping billing. Two test rounds, 16 real upstream requests, 100% routed to five_hour.

What that opens up: dario users can replace CC's verbose system prompt with their own, modify CC's behavioral defaults, or strip the verbosity caps entirely — all on subscription billing, no overage flip.

The classifier mechanism is still real. The signal list was inflated. The actual fingerprint is narrower than we said.

Test 1 — system prompt content invariance (7 variants)

Captured CC v2.1.123's actual outbound /v1/messages?beta=true body. Held everything else identical (model, tools, effort, max_tokens, body field order, metadata, anthropic-beta with oauth-2025-04-20 for OAuth path, OAuth bearer from a Max account). Mutated only system[]:

#	Variant	Sys total chars	Billing
1	Control (CC verbatim)	27,310	`five_hour`
2	Prepend single character	27,311	`five_hour`
3	Word substitution: `concise` → `brief`	27,306	`five_hour`
4	Remove a sentence	27,049	`five_hour`
5	Replace block 2 with 321-char custom prompt	321	`five_hour`
6	Add a 4th block (3 → 4)	27,369	`five_hour`
7	Length padding (+500 chars)	27,812	`five_hour`

Variant #5 is load-bearing. We dropped CC's entire 27k system prompt for a 321-char custom one and kept subscription billing.

Test 2 — behavioral capability delta when CC's constraints are stripped

Test 1 established the classifier ignores system prompt content. Natural follow-on: with that ceiling lifted, what does the model actually DO when CC's behavioral constraints are removed?

Two strip levels:

Partial strip — removes "# Tone and style" and "# Text output" sections and several scope-discipline / commenting bullets in "# Doing tasks." Keeps every "IMPORTANT:" alignment line, keeps tool descriptions, keeps "# Executing actions with care."
Aggressive strip — additionally removes prompt-level alignment reminders (the "IMPORTANT:" lines that re-state RLHF-trained refusal categories) and most of "# Executing actions with care." Critically does not remove RLHF — alignment is trained, not prompted; the model keeps refusing harmful content because that's where alignment lives.

Three test prompts that should hit verbosity / format / scope-discipline constraints:

Prompt	Control chars	Partial-strip chars	Aggressive-strip chars
Code-with-comments task	2,970	0 (model picked tool_use)	5,675
Detailed technical explanation	3,851	4,546 (+18%)	4,585 (+19%)
Open-ended decision question	401	1,092 (+172%)	1,116 (+178%)

All 9 routed to five_hour. Model behavior on benign tasks remained aligned across every variant.

Two findings worth pulling out:

The big behavioral lever is the verbosity / format / scope-discipline language, not the alignment language. Removing "Tone and style" + "Text output" produces 1.18-2.78× capability change on open-ended work. Aggressive strip adds <3% over partial.
Prompt-level alignment reminders contribute approximately zero to the model's refusal behavior on benign tasks. They're redundant with RLHF — the model's "refuse to help with destructive techniques" is trained, and the prompt's "IMPORTANT: Refuse..." line is just restating it. Stripping the restatement doesn't unlock harmful behavior because the trained refusal is what enforces it.

What this changes about #13

Wrong: "System prompt: Exactly 3 blocks" as a detection signal.

Still presumed-fingerprint (untested by these rounds): tool names, billing tag, JSON field order, effort default (now xhigh for Opus 4.7 / high for others per Anthropic's April 23 postmortem), max_tokens, thinking type, non-CC field scrubbing.

Mechanism still real: behavioral classifier exists, five_hour vs overage routing depends on request shape. We just over-counted inputs, and the actual axis is narrower than #13 implied.

What dario users gain

Three opt-in feature surfaces, each backed by the data above:

Replace CC's 27k system prompt with your own — saves up to ~27,000 input tokens per non-cached request; significant on multi-agent stacks.
Strip CC's behavioral defaults — verbosity caps, no-comments-by-default, scope discipline. CC's UX choices, not model alignment. Your tool, your defaults.
Add operator instructions on top of CC's prompt — agent personas, tool-use heuristics, output formatters; alongside the rest of dario's fingerprint replay.

All three preserve subscription billing. Aggressive alignment-language stripping isn't part of the recommended surface — the data shows it doesn't add useful capability, and "your tool, your defaults" is a cleaner story than "drop safety."

Landing in dario [vX.Y.Z] (separate ship, watch the next release).

Methodology

Test scripts (committed in PR #171):

scripts/test-system-prompt-mods.mjs — Test 1, the 7-variant system prompt mutation ladder.
scripts/test-constraint-removal.mjs — Test 2, the 3-prompt × 3-strip-level capability ladder.
scripts/capture-full-body.mjs — captures CC's actual wire values for any installed CC version.

Each captures CC's outbound body via loopback MITM, mutates per variant, sends to api.anthropic.com with OAuth bearer (read directly from ~/.claude/.credentials.json), reads the anthropic-ratelimit-unified-representative-claim response header. Reproducible in <60s against any installed CC + Max OAuth.

Maintainer-only diagnostics — not part of npm test, not invoked from CI, not on dario's runtime path.

Caveats

Tested against Opus 4.7 only (CC's current default). Sonnet 4.6 and other models pending.
Test prompts in test 2 are single-turn. Multi-turn / tool-use sequences may behave differently — pending.
The system prompt is one of Anthropic's Claude Code defaults are detection signals, not optimizations #13's claimed 8 signals. The other 7 axes (tool array, billing tag, body field order, etc.) remain presumed-fingerprint until they get the same treatment.

What's next

Same ladder against Sonnet 4.6 (model invariance)
Tool-array modification ladder — likely the highest-fingerprinted axis based on Anthropic's Claude Code defaults are detection signals, not optimizations #13's data
Multi-turn / tool-use conversation tests (current is single-turn)
dario feature ship: opt-in flags for system prompt customization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-testing #13: the system prompt is not a fingerprint signal #172

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Re-testing #13: the system prompt is not a fingerprint signal #172

Uh oh!

askalf Apr 29, 2026 Maintainer

TL;DR

Test 1 — system prompt content invariance (7 variants)

Test 2 — behavioral capability delta when CC's constraints are stripped

What this changes about #13

What dario users gain

Methodology

Caveats

What's next

Replies: 0 comments

askalf
Apr 29, 2026
Maintainer