Skip to content

[Anthropic][Frontend] auto-extract system messages from messages array#43959

Closed
aleksandaryanakiev wants to merge 4 commits into
vllm-project:mainfrom
aleksandaryanakiev:fix/anthropic-system-message-extraction
Closed

[Anthropic][Frontend] auto-extract system messages from messages array#43959
aleksandaryanakiev wants to merge 4 commits into
vllm-project:mainfrom
aleksandaryanakiev:fix/anthropic-system-message-extraction

Conversation

@aleksandaryanakiev
Copy link
Copy Markdown
Contributor

Purpose

Fix: Auto-extract system messages from Anthropic messages array

In Claude Code v2.1.156, the CLI puts a system message in the
messages array instead of the top-level system array. This causes a Pydantic validation error:

Input should be 'user' or 'assistant'

This PR adds a model_validator(mode="before") to AnthropicMessagesRequest that silently extracts
system messages from the messages array and moves them to the system field, making vLLM more
compatible with this change.

Test Plan

I tested it on our machine running vLLM with this change on CC version v2.1.156, and it's working properly. Also I created 6 tests and ran them with: pytest tests/entrypoints/anthropic/test_anthropic_messages_conversion.py

Test Result

The 400 BAD REQUEST error is gone, and everything is working as expected

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Aleksandar Yanakiev <alexander.yanakiev@discretestack.com>
Signed-off-by: Aleksandar Yanakiev <alexander.yanakiev@discretestack.com>
@bbrowning
Copy link
Copy Markdown
Collaborator

Instead of extracting and converting in a validator, could we simply allow system at

role: Literal["user", "assistant"]

I believe _convert_messages in anthropic/serving.py would then already handle turning that into a system message for Chat Completions.

@aleksandaryanakiev
Copy link
Copy Markdown
Contributor Author

Instead of extracting and converting in a validator, could we simply allow system at

role: Literal["user", "assistant"]

I believe _convert_messages in anthropic/serving.py would then already handle turning that into a system message for Chat Completions.

It's possible, I thought about that before I did this fix, my only concern with it is that the message will leak into the conversation (not sure how Claude Code will handle it)

@bbrowning
Copy link
Copy Markdown
Collaborator

Hmm - with just allowing system in the roles list like above, I added some logging to dump what we're getting in requests in a multi-turn scenario from claude code:

(APIServer pid=4008422) !!! anthropic messages:
(APIServer pid=4008422)       role: user, len(content): 5
(APIServer pid=4008422)       role: system, len(content): 4177
(APIServer pid=4008422)       role: assistant, len(content): 2
(APIServer pid=4008422)       role: user, len(content): 1
(APIServer pid=4008422) !!! openai messages:
(APIServer pid=4008422)       role: system, len(content): 6292
(APIServer pid=4008422)       role: user, len(content): 5
(APIServer pid=4008422)       role: system, len(content): 4177
(APIServer pid=4008422)       role: assistant, len(content): 1628
(APIServer pid=4008422)       role: user, len(content): 111
(APIServer pid=4008422) INFO:     10.14.217.25:39452 - "POST /v1/messages?beta=true HTTP/1.1" 200 OK
(Worker_TP0 pid=4009673) WARNING 05-29 09:28:51 [jit_monitor.py:103] Triton kernel JIT compilation during inference: _compute_slot_mapping_kernel. This causes a latency spike; consider extending warmup to cover this shape/config.
(Worker_TP0 pid=4009673) WARNING 05-29 09:28:51 [jit_monitor.py:103] Triton kernel JIT compilation during inference: kernel_unified_attention. This causes a latency spike; consider extending warmup to cover this shape/config.
(APIServer pid=4008422) INFO 05-29 09:29:04 [loggers.py:271] Engine 000: Avg prompt throughput: 2274.9 tokens/s, Avg generation throughput: 23.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.6%, Prefix cache hit rate: 0.0%
(APIServer pid=4008422) INFO 05-29 09:29:14 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 8.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=4008422) INFO 05-29 09:29:24 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=4008422) !!! anthropic messages:
(APIServer pid=4008422)       role: user, len(content): 5
(APIServer pid=4008422)       role: system, len(content): 4177
(APIServer pid=4008422)       role: assistant, len(content): 2
(APIServer pid=4008422)       role: user, len(content): 111
(APIServer pid=4008422)       role: assistant, len(content): 2
(APIServer pid=4008422)       role: user, len(content): 1
(APIServer pid=4008422) !!! openai messages:
(APIServer pid=4008422)       role: system, len(content): 6292
(APIServer pid=4008422)       role: user, len(content): 5
(APIServer pid=4008422)       role: system, len(content): 4177
(APIServer pid=4008422)       role: assistant, len(content): 1628
(APIServer pid=4008422)       role: user, len(content): 111
(APIServer pid=4008422)       role: assistant, len(content): 388
(APIServer pid=4008422)       role: user, len(content): 69
(APIServer pid=4008422) INFO:     10.14.217.25:34068 - "POST /v1/messages?beta=true HTTP/1.1" 200 OK
(APIServer pid=4008422) INFO 05-29 09:38:04 [loggers.py:271] Engine 000: Avg prompt throughput: 13.4 tokens/s, Avg generation throughput: 2.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.6%, Prefix cache hit rate: 49.8%
(APIServer pid=4008422) INFO 05-29 09:38:14 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 49.8%
(APIServer pid=4008422) INFO 05-29 09:38:24 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 49.8%
(APIServer pid=4008422) !!! anthropic messages:
(APIServer pid=4008422)       role: user, len(content): 5
(APIServer pid=4008422)       role: system, len(content): 4177
(APIServer pid=4008422)       role: assistant, len(content): 2
(APIServer pid=4008422)       role: user, len(content): 111
(APIServer pid=4008422)       role: assistant, len(content): 2
(APIServer pid=4008422)       role: user, len(content): 69
(APIServer pid=4008422)       role: assistant, len(content): 2
(APIServer pid=4008422)       role: user, len(content): 1
(APIServer pid=4008422) !!! openai messages:
(APIServer pid=4008422)       role: system, len(content): 6292
(APIServer pid=4008422)       role: user, len(content): 5
(APIServer pid=4008422)       role: system, len(content): 4177
(APIServer pid=4008422)       role: assistant, len(content): 1628
(APIServer pid=4008422)       role: user, len(content): 111
(APIServer pid=4008422)       role: assistant, len(content): 388
(APIServer pid=4008422)       role: user, len(content): 69
(APIServer pid=4008422)       role: assistant, len(content): 60
(APIServer pid=4008422)       role: user, len(content): 17

I don't see any additional system turns accumulating. You can see how the original top level system property comes in in front of the first user role. But, then there's that other system message in there being sent with the messages list. Perhaps I need to inspect what's in those messages, but I can confirm we don't keep adding additional system messages with every turn as we roundtrip here.

@aleksandaryanakiev
Copy link
Copy Markdown
Contributor Author

This is the system message that is in the messages array:

{
"role":"system",
"content":"SessionStart hook additional context: <EXTREMELY_IMPORTANT>\nYou have superpowers.\n\nBelow is the full content of your 'superpowers:using-superpowers' skill - your introduction to using skills. For all other skills, use the 'Skill' tool:\n\n---\nname: using-superpowers\ndescription: Use when starting any conversation - establishes how to find and use skills, requiring Skill tool invocation before ANY response including clarifying questions\n---\n\n\nIf you were dispatched as a subagent to execute a specific task, skip this skill.\n\n\n\nIf you think there is even a 1% chance a skill might apply to what you are doing, you ABSOLUTELY MUST invoke the skill.\n\nIF A SKILL APPLIES TO YOUR TASK, YOU DO NOT HAVE A CHOICE. YOU MUST USE IT.\n\nThis is not negotiable. This is not optional. You cannot rationalize your way out of this.\n\n\n## Instruction Priority\n\nSuperpowers skills override default system prompt behavior, but user instructions always take precedence:\n\n1. User's explicit instructions (CLAUDE.md, GEMINI.md, AGENTS.md, direct requests) — highest priority\n2. Superpowers skills — override default system behavior where they conflict\n3. Default system prompt — lowest priority\n\nIf CLAUDE.md, GEMINI.md, or AGENTS.md says "don't use TDD" and a skill says "always use TDD," follow the user's instructions. The user is in control.\n\n## How to Access Skills\n\nIn Claude Code: Use the Skill tool. When you invoke a skill, its content is loaded and presented to you—follow it directly. Never use the Read tool on skill files.\n\nIn Copilot CLI: Use the skill tool. Skills are auto-discovered from installed plugins. The skill tool works the same as Claude Code's Skill tool.\n\nIn Gemini CLI: Skills activate via the activate_skill tool. Gemini loads skill metadata at session start and activates the full content on demand.\n\nIn other environments: Check your platform's documentation for how skills are loaded.\n\n## Platform Adaptation\n\nSkills use Claude Code tool names. Non-CC platforms: see references/copilot-tools.md (Copilot CLI), references/codex-tools.md (Codex) for tool equivalents. Gemini CLI users get the tool mapping loaded automatically via GEMINI.md.\n\n# Using Skills\n\n## The Rule\n\nInvoke relevant or requested skills BEFORE any response or action. Even a 1% chance a skill might apply means that you should invoke the skill to check. If an invoked skill turns out to be wrong for the situation, you don't need to use it.\n\ndot\ndigraph skill_flow {\n "User message received" [shape=doublecircle];\n "About to EnterPlanMode?" [shape=doublecircle];\n "Already brainstormed?" [shape=diamond];\n "Invoke brainstorming skill" [shape=box];\n "Might any skill apply?" [shape=diamond];\n "Invoke Skill tool" [shape=box];\n "Announce: 'Using [skill] to [purpose]'" [shape=box];\n "Has checklist?" [shape=diamond];\n "Create TodoWrite todo per item" [shape=box];\n "Follow skill exactly" [shape=box];\n "Respond (including clarifications)" [shape=doublecircle];\n\n "About to EnterPlanMode?" -> "Already brainstormed?";\n "Already brainstormed?" -> "Invoke brainstorming skill" [label="no"];\n "Already brainstormed?" -> "Might any skill apply?" [label="yes"];\n "Invoke brainstorming skill" -> "Might any skill apply?";\n\n "User message received" -> "Might any skill apply?";\n "Might any skill apply?" -> "Invoke Skill tool" [label="yes, even 1%"];\n "Might any skill apply?" -> "Respond (including clarifications)" [label="definitely not"];\n "Invoke Skill tool" -> "Announce: 'Using [skill] to [purpose]'";\n "Announce: 'Using [skill] to [purpose]'" -> "Has checklist?";\n "Has checklist?" -> "Create TodoWrite todo per item" [label="yes"];\n "Has checklist?" -> "Follow skill exactly" [label="no"];\n "Create TodoWrite todo per item" -> "Follow skill exactly";\n}\n\n\n## Red Flags\n\nThese thoughts mean STOP—you're rationalizing:\n\n| Thought | Reality |\n|---------|---------|\n| "This is just a simple question" | Questions are tasks. Check for skills. |\n| "I need more context first" | Skill check comes BEFORE clarifying questions. |\n| "Let me explore the codebase first" | Skills tell you HOW to explore. Check first. |\n| "I can check git/files quickly" | Files lack conversation context. Check for skills. |\n| "Let me gather information first" | Skills tell you HOW to gather information. |\n| "This doesn't need a formal skill" | If a skill exists, use it. |\n| "I remember this skill" | Skills evolve. Read current version. |\n| "This doesn't count as a task" | Action = task. Check for skills. |\n| "The skill is overkill" | Simple things become complex. Use it. |\n| "I'll just do this one thing first" | Check BEFORE doing anything. |\n| "This feels productive" | Undisciplined action wastes time. Skills prevent this. |\n| "I know what that means" | Knowing the concept ≠ using the skill. Invoke it. |\n\n## Skill Priority\n\nWhen multiple skills could apply, use this order:\n\n1. Process skills first (brainstorming, debugging) - these determine HOW to approach the task\n2. Implementation skills second (frontend-design, mcp-builder) - these guide execution\n\n"Let's build X" → brainstorming first, then implementation skills.\n"Fix this bug" → debugging first, then domain-specific skills.\n\n## Skill Types\n\nRigid (TDD, debugging): Follow exactly. Don't adapt away discipline.\n\nFlexible (patterns): Adapt principles to context.\n\nThe skill itself tells you which.\n\n## User Instructions\n\nInstructions say WHAT, not HOW. "Add X" or "Fix Y" doesn't mean skip workflows.\n\n\n</EXTREMELY_IMPORTANT>\n\n# MCP Server Instructions\n\nThe following MCP servers have provided instructions for how to use their tools and resources:\n\n## codegraph\n# Codegraph — code intelligence over an indexed knowledge graph\n\nCodegraph is a SQLite knowledge graph of every symbol, edge, and file\nin the workspace. Reads are sub-millisecond; the index lags writes by\nabout a second through the file watcher. Consult it BEFORE writing or\nediting code, not during.\n\n## Answer directly — don't delegate exploration\n\nFor "how does X work", architecture, trace, or where-is-X questions,\nanswer DIRECTLY using 2-3 codegraph calls: codegraph_context first,\nthen ONE codegraph_explore for the source of the symbols it surfaces.\nCodegraph IS the pre-built search index — so delegating the lookup to a\nseparate file-reading sub-task/agent, or running your own grep + read\nloop, repeats work codegraph already did and costs more for the same\nanswer. Reach for raw Read/Grep only to confirm a specific detail\ncodegraph didn't cover. A direct codegraph answer is typically a handful\nof calls; a grep/read exploration is dozens.\n\n## Tool selection by intent\n\n- "What is the symbol named X?"codegraph_search\n- "What's the deal with this task / feature / area?"codegraph_context (PRIMARY — composes search + node + callers + callees in one call)\n- "How does X reach/become Y? / trace the flow / the path from X to Y"codegraph_trace (ONE call returns the whole call path, including dynamic-dispatch hops — callbacks, React re-render, JSX children — that grep can't follow)\n- "What calls this?"codegraph_callers\n- "What does this call?"codegraph_callees\n- "What would changing this break?"codegraph_impact\n- "Show me this symbol's source / signature / docstring."codegraph_node\n- "Show me several related symbols' source / survey an area."codegraph_explore (ONE capped call; prefer over many codegraph_node/Read)\n- "What's in directory X?"codegraph_files\n- "Is the index ready / what's its size?"codegraph_status\n\n## Common chains\n\n- Flow / "how does X reach Y": codegraph_trace from→to FIRST — one call returns the entire path … [truncated]\n\n## plugin:context7:context7\nUse this server to fetch current documentation whenever the user asks about a library, framework, SDK, API, CLI tool, or cloud service -- even well-known ones like React, Next.js, Prisma, Express, Tailwind, Django, or Spring Boot. This includes API syntax, configuration, version migration, library-specific debugging, setup instructions, and CLI tool usage. Use even when you think you know the answer -- your training data may not reflect recent changes. Prefer this over web search for library docs.\n\nDo not use for: refactoring, writing scripts from scratch, debugging business logic, code review, or general programming concepts.\n\nThe following skills are available for use with the Skill tool:\n\n- find-docs: Retrieves up-to-date documentation, API references, and code examples for any developer technology. Use this skill whenever the user asks about a specific library, framework, SDK, CLI tool, or cloud service -- even for well-known ones like React, Next.js, Prisma, Express, Tailwind, Django, or Spring Boot. Your training data may not reflect recent API changes or version updates.\nAlways use for: API syntax questions, configuration options, version migration issues, "how do I" questions mentioning a library name, debugging that involves library-specific behavior, setup instructions, and CLI tool usage.\nUse even when you think you know the answer -- do not rely on training data for API details, signatures, or configuration options as they are frequently outdated. Always verify against current docs. Prefer this over web search for library documentation and API details.\n- deep-research: Deep research harness — fan-out web searches, fetch sources, adversarially verify claims, synthesize a cited report. - When the user wants a deep, multi-source, fact-checked research report on any topic. BEFORE invoking, check if the question is specific enough to research directly — if underspecified (e.g., "what car to buy" without budget/use-case/region), ask 2-3 clarifying questions to narrow scope. Then pass the refined question as args, weaving the answers in.\n- superpowers:brainstorming: You MUST use this before any creative work - creating features, building components, adding functionality, or modifying behavior. Explores user intent, requirements and design before implementation.\n- superpowers:dispatching-parallel-agents\n- superpowers:executing-plans\n- superpowers:finishing-a-development-branch\n- superpowers:receiving-code-review\n- superpowers:requesting-code-review: Use when completing tasks, implementing major features, or before merging to verify work meets requirements\n- superpowers:subagent-driven-development: Use when executing implementation plans with independent tasks in the current session\n- superpowers:systematic-debugging: Use when encountering any bug, test failure, or unexpected behavior, before proposing fixes\n- superpowers:test-driven-development\n- superpowers:using-git-worktrees\n- superpowers:using-superpowers: Use when starting any conversation - establishes how to find and use skills, requiring Skill tool invocation before ANY response including clarifying questions\n- superpowers:verification-before-completion\n- superpowers:writing-plans\n- superpowers:writing-skills\n- update-config: Use this skill to configure the Claude Code harness via settings.json. Automated behaviors ("from now on when X", "each time X", "whenever X", "before/after X") require hooks configured in settings.json - the harness executes these, not Claude, so memory/preferences cannot fulfill them. Also use for: permissions ("allow X", "add permission", "move permission to"), env vars ("set X=Y"), hook troubleshooting, or any changes to settings.json/settings.local.json files. Examples: "allow npm commands", "add bq permission to global settings", "move permission to user settings", "set DEBUG=true", "when claude stops show X". For simple settings like theme/model, suggest the /config command.\n- keybindings-help: Use when the user wants to customize keyboard shortcuts, rebind keys, add chord bindings, or modify ~/.claude/keybindings.json. Examples: "rebind ctrl+s", "add a chord shortcut", "change the submit key", "customize keybindings".\n- verify: Verify that a code change actually does what it's supposed to by running the app and observing behavior. Use when asked to verify a PR, confirm a fix works, test a change manually, check that a feature works, or validate local changes before pushing.\n- code-review: Review the current diff for correctness bugs and reuse/simplification/efficiency cleanups at the given effort level (low/medium: fewer, high-confidence findings; high→max: broader coverage, may include uncertain findings). Pass --comment to post findings as inline PR comments, or --fix to apply the findings to the working tree after the review.\n- simplify: Review the changed code for reuse, simplification, efficiency, and altitude cleanups, then apply the fixes. Quality only — it does not hunt for bugs; use /code-review for that.\n- fewer-permission-prompts: Scan your transcripts for common read-only Bash and MCP tool calls, then add a prioritized allowlist to project .claude/settings.json to reduce permission prompts.\n- loop: Run a prompt or slash command on a recurring interval (e.g. /loop 5m /foo). Omit the interval to let the model self-pace. - When the user wants to set up a recurring task, poll for status, or run something repeatedly on an interval (e.g. "check the deploy every 5 minutes", "keep running /babysit-prs"). Do NOT invoke for one-off tasks.\n- claude-api: Build, debug, and optimize Claude API / Anthropic SDK apps. Apps built with this skill should include prompt caching. Also handles migrating existing Claude API code between Claude model versions (4.5 → 4.6, 4.6 → 4.7, retired-model replacements).\nTRIGGER when: code imports anthropic/@anthropic-ai/sdk; user asks for the Claude API, Anthropic SDK, or Managed Agents; user adds/modifies/tunes a Claude feature (caching, thinking, compaction, tool use, batch, files, citations, memory) or model (Opus/Sonnet/Haiku) in a file; questions about prompt caching / cache hit rate in an Anthropic SDK project.\nSKIP: file imports openai/other-provider SDK, filename like *-openai.py/*-generic.py, provider-neutral code, general programming/ML.\n- run: Launch and drive this project's app to see a change working. Use when asked to run, start, or screenshot the app, or to confirm a change works in the real app (not just tests). First looks for a project skill that already covers launching the app; otherwise falls back to built-in patterns per project type (CLI, server, TUI, Electron, browser-driven, library).\n- init\n- review\n- security-review"
}

I think it should be in the system prompts, and I think your way of doing it (with allowing system) looks better than mine, so I can just change the PR to allow system in the roles or open entirely new PR and close this one.

@bbrowning
Copy link
Copy Markdown
Collaborator

Ahh my system role in messages has entirely different content than yours, and both look related to skills or other things installed. I think we can start by just allowing system messages if you want to adjust this and the test.

Some models, such as Qwen 3 variants, will complain if there is any system message not in the first position. But, there are other PRs open to handle that more generally, as we also have that problem with Responses API and Codex CLI usage with those models so the Messages API case here is no different.

@bbrowning
Copy link
Copy Markdown
Collaborator

Hmm - testing my simple proposed fix, I don't think having system messages interleaved after user messages is going to work for typical open weight models. A non-trivial number of them actually expect all system messages to be in the very first message, and that gets treated differently by the chat template with special tokens for the system message that won't be present for later system messages.

@bbrowning bbrowning added the verified Run pre-commit for new contributors without triggering other tests label May 29, 2026
bbrowning added a commit to bbrowning/vllm that referenced this pull request May 29, 2026
Claude Code sends the system prompt as messages with role "system"
in the messages array instead of the top-level system field. Extract
them during Pydantic validation and merge into the system field so
models that expect system content before user messages work correctly.

Based on vllm-project#43959.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Copy link
Copy Markdown
Collaborator

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing this in the real-world, your original proposed fix is better than my suggestion. I was getting some odd trajectories with a self-hosted Nemotron 3 Super model and Claude Code with my simpler suggestion, while your fix of hoisting all of these into the original system looks more stable. I haven't had time to do a full before/after eval, but given that we're returning a 400 error today with the latest Claude Code will defer that until later.

I pulled this locally and confirmed the new unit tests pass as well as tested in a live server and things look reasonable. Thanks!

@bbrowning bbrowning added the ready ONLY add when PR is ready to merge/full CI is needed label May 29, 2026
Copy link
Copy Markdown
Collaborator

@sfeng33 sfeng33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix! One gap I noticed: AnthropicCountTokensRequest in the same file has the same messages: list[AnthropicMessage] field, so it would hit the same Pydantic validation error if a client sends system messages in the messages array to /v1/messages/count_tokens.

I think sth like this can be done:

def _extract_system_from_messages(request_body: dict) -> dict:
    # ... existing logic ...

class AnthropicMessagesRequest(BaseModel):
    @model_validator(mode="before")
    @classmethod
    def extract_system_messages(cls, v):
        return _extract_system_from_messages(v)

class AnthropicCountTokensRequest(BaseModel):
    @model_validator(mode="before")
    @classmethod
    def extract_system_messages(cls, v):
        return _extract_system_from_messages(v)

Signed-off-by: Aleksandar Yanakiev <alexander.yanakiev@discretestack.com>
@aleksandaryanakiev aleksandaryanakiev force-pushed the fix/anthropic-system-message-extraction branch from 4c4e126 to 7532441 Compare June 1, 2026 06:09
Copy link
Copy Markdown
Collaborator

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Claude Code v2.1.156, the CLI puts a system message in the
messages array instead of the top-level system array.

I captured the request sent by CC. Both the system message and the top-level system array are present at the same time.

{
    "model": "Qwen/Qwen3.5-27B-FP8",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "<system-reminder>\n.....</system-reminder>\n\n"
                },
                {
                    "type": "text",
                    "text": "help?",
                    "cache_control": {
                        "type": "ephemeral"
                    }
                }
            ]
        },
        {
            "role": "system",
            "content": "....."
        }
    ],
    "system": [
        {
            "type": "text",
            "text": "x-anthropic-billing-header: cc_version=2.1.160.bca; cc_entrypoint=cli; cch=d1d48;"
        },
        {
            "type": "text",
            "text": "You are Claude Code, Anthropic's official CLI for Claude.",
            "cache_control": {
                "type": "ephemeral"
            }
        },
        {
            "type": "text",
            "text": "....",
            "cache_control": {
                "type": "ephemeral"
            }
        }
    ],
    "tools": []
}

Just a question:
Why not handle it in def _convert_system_message?

@chaunceyjiang
Copy link
Copy Markdown
Collaborator

@aleksandaryanakiev Thanks for this PR. I feel that this implementation is not very clear in terms of semantics. I have submitted a new PR #44283 and added you as a Co-Authored-By. Could you take a look? I think this new implementation should be cleaner.

bbrowning added a commit to bbrowning/vllm that referenced this pull request Jun 2, 2026
Claude Code sends the system prompt as messages with role "system"
in the messages array instead of the top-level system field. Extract
them during Pydantic validation and merge into the system field so
models that expect system content before user messages work correctly.

Based on vllm-project#43959.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Ben Browning <bbrownin@redhat.com>
bbrowning added a commit to bbrowning/vllm that referenced this pull request Jun 2, 2026
Claude Code sends the system prompt as messages with role "system"
in the messages array instead of the top-level system field. Extract
them during Pydantic validation and merge into the system field so
models that expect system content before user messages work correctly.

Based on vllm-project#43959.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed verified Run pre-commit for new contributors without triggering other tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants