Skip to content

Latest commit

 

History

History
779 lines (629 loc) · 20.3 KB

File metadata and controls

779 lines (629 loc) · 20.3 KB

OpenAI-Compatible Server

vllm-mlx provides a FastAPI server with full OpenAI API compatibility.

Starting the Server

Simple Mode (Default)

Maximum throughput for single user:

vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

Continuous Batching Mode

For multiple concurrent users:

vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

With Paged Cache

Memory-efficient caching for production:

vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching --use-paged-cache

Server Options

Option Description Default
--port Server port 8000
--host Server host 0.0.0.0
--api-key API key for authentication None
--rate-limit Requests per minute per client (0 = disabled) 0
--timeout Request timeout in seconds 300
--enable-metrics Expose Prometheus metrics on /metrics False
--continuous-batching Enable batching for multi-user False
--use-paged-cache Enable paged KV cache False
--cache-memory-mb Cache memory limit in MB Auto
--cache-memory-percent Fraction of RAM for cache 0.20
--max-tokens Default max tokens 32768
--default-temperature Default temperature when not specified None
--default-top-p Default top_p when not specified None
--stream-interval Tokens per stream chunk 1
--mcp-config Path to MCP config file None
--reasoning-parser Parser for reasoning models (qwen3, deepseek_r1) None
--embedding-model Pre-load an embedding model at startup None
--enable-auto-tool-choice Enable automatic tool calling False
--tool-call-parser Tool call parser (see Tool Calling) None

API Endpoints

Chat Completions

POST /v1/chat/completions
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Non-streaming
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100
)

# Streaming
stream = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Completions

POST /v1/completions
response = client.completions.create(
    model="default",
    prompt="The capital of France is",
    max_tokens=50
)

Models

GET /v1/models

Returns available models.

Embeddings

POST /v1/embeddings
response = client.embeddings.create(
    model="mlx-community/multilingual-e5-small-mlx",
    input="Hello world"
)
print(response.data[0].embedding[:5])  # First 5 dimensions

See Embeddings Guide for details.

Health Check

GET /health

Returns server status.

Metrics

GET /metrics

Prometheus scrape endpoint for server, cache, scheduler, and request metrics. The endpoint is disabled by default and is enabled with --enable-metrics.

vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit \
  --enable-metrics

/metrics is intentionally unauthenticated. Expose it only on a trusted network or behind a reverse proxy / firewall that limits who can scrape it.

Anthropic Messages API

POST /v1/messages

Anthropic-compatible endpoint that allows tools like Claude Code and OpenCode to connect directly to vllm-mlx. Internally it translates Anthropic requests to OpenAI format, runs inference through the engine, and converts the response back to Anthropic format.

Capabilities:

  • Non-streaming and streaming responses (SSE)
  • System messages (plain string or list of content blocks)
  • Multi-turn conversations with user and assistant messages
  • Tool calling with tool_use / tool_result content blocks
  • Token counting for budget tracking
  • Multimodal content (images via source blocks)
  • Client disconnect detection (returns HTTP 499)
  • Automatic special token filtering in streamed output

Non-streaming

from anthropic import Anthropic

client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")

response = client.messages.create(
    model="default",
    max_tokens=256,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)
# Response includes: response.id, response.model, response.stop_reason,
# response.usage.input_tokens, response.usage.output_tokens

Streaming

Streaming follows the Anthropic SSE event protocol. Events are emitted in this order: message_start -> content_block_start -> content_block_delta (repeated) -> content_block_stop -> message_delta -> message_stop

with client.messages.stream(
    model="default",
    max_tokens=256,
    messages=[{"role": "user", "content": "Tell me a story"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="")

System messages

System messages can be a plain string or a list of content blocks:

# Plain string
response = client.messages.create(
    model="default",
    max_tokens=256,
    system="You are a helpful coding assistant.",
    messages=[{"role": "user", "content": "Write a hello world in Python"}]
)

# List of content blocks
response = client.messages.create(
    model="default",
    max_tokens=256,
    system=[
        {"type": "text", "text": "You are a helpful assistant."},
        {"type": "text", "text": "Be concise in your answers."},
    ],
    messages=[{"role": "user", "content": "What is 2+2?"}]
)

Tool calling

Define tools with name, description, and input_schema. The model returns tool_use content blocks when it wants to call a tool. Send results back as tool_result blocks.

# Step 1: Send request with tools
response = client.messages.create(
    model="default",
    max_tokens=1024,
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[{
        "name": "get_weather",
        "description": "Get weather for a city",
        "input_schema": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"]
        }
    }]
)

# Step 2: Check if model wants to use tools
for block in response.content:
    if block.type == "tool_use":
        print(f"Tool: {block.name}, Input: {block.input}, ID: {block.id}")
        # response.stop_reason will be "tool_use"

# Step 3: Send tool result back
response = client.messages.create(
    model="default",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"},
        {"role": "assistant", "content": response.content},
        {"role": "user", "content": [
            {
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": "Sunny, 22C"
            }
        ]}
    ],
    tools=[{
        "name": "get_weather",
        "description": "Get weather for a city",
        "input_schema": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"]
        }
    }]
)
print(response.content[0].text)  # "The weather in Paris is sunny, 22C."

Tool choice modes:

tool_choice Behavior
{"type": "auto"} Model decides whether to call tools (default)
{"type": "any"} Model must call at least one tool
{"type": "tool", "name": "get_weather"} Model must call the specified tool
{"type": "none"} Model will not call any tools

Multi-turn conversations

messages = [
    {"role": "user", "content": "My name is Alice."},
    {"role": "assistant", "content": "Nice to meet you, Alice!"},
    {"role": "user", "content": "What's my name?"},
]

response = client.messages.create(
    model="default",
    max_tokens=100,
    messages=messages
)

Token counting

POST /v1/messages/count_tokens

Counts input tokens for an Anthropic request using the model's tokenizer. Useful for budget tracking before sending a request. Counts tokens from system messages, conversation messages, tool_use inputs, tool_result content, and tool definitions (name, description, input_schema).

import requests

resp = requests.post("http://localhost:8000/v1/messages/count_tokens", json={
    "model": "default",
    "messages": [{"role": "user", "content": "Hello, how are you?"}],
    "system": "You are helpful.",
    "tools": [{
        "name": "search",
        "description": "Search the web",
        "input_schema": {"type": "object", "properties": {"q": {"type": "string"}}}
    }]
})
print(resp.json())  # {"input_tokens": 42}

curl examples

Non-streaming:

curl http://localhost:8000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Streaming:

curl http://localhost:8000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "max_tokens": 256,
    "stream": true,
    "messages": [{"role": "user", "content": "Tell me a joke"}]
  }'

Token counting:

curl http://localhost:8000/v1/messages/count_tokens \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
# {"input_tokens": 12}

Request fields

Field Type Required Default Description
model string yes - Model name (use "default" for the loaded model)
messages list yes - Conversation messages with role and content
max_tokens int yes - Maximum number of tokens to generate
system string or list no null System prompt (string or list of {"type": "text", "text": "..."} blocks)
stream bool no false Enable SSE streaming
temperature float no 0.7 Sampling temperature (0.0 = deterministic, 1.0 = creative)
top_p float no 0.9 Nucleus sampling threshold
top_k int no null Top-k sampling
stop_sequences list no null Sequences that stop generation
tools list no null Tool definitions with name, description, input_schema
tool_choice dict no null Tool selection mode (auto, any, tool, none)
metadata dict no null Arbitrary metadata (passed through, not used by server)

Response format

Non-streaming response:

{
  "id": "msg_abc123...",
  "type": "message",
  "role": "assistant",
  "model": "default",
  "content": [
    {"type": "text", "text": "Hello! How can I help?"}
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 12,
    "output_tokens": 8
  }
}

When tools are called, content includes tool_use blocks and stop_reason is "tool_use":

{
  "content": [
    {"type": "text", "text": "Let me check the weather."},
    {
      "type": "tool_use",
      "id": "call_abc123",
      "name": "get_weather",
      "input": {"city": "Paris"}
    }
  ],
  "stop_reason": "tool_use"
}

Stop reasons:

stop_reason Meaning
end_turn Model finished naturally
tool_use Model wants to call a tool
max_tokens Hit the max_tokens limit

Using with Claude Code

Point Claude Code directly at your vllm-mlx server:

# Start the server
vllm-mlx serve mlx-community/Qwen3-Coder-Next-235B-A22B-4bit \
  --continuous-batching \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

# In another terminal, configure Claude Code
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

Server Status

GET /v1/status

Real-time monitoring endpoint that returns server-wide statistics and per-request details. Useful for debugging performance, tracking cache efficiency, and monitoring Metal GPU memory.

curl -s http://localhost:8000/v1/status | python -m json.tool

Example response:

{
  "status": "running",
  "model": "mlx-community/Qwen3-8B-4bit",
  "uptime_s": 342.5,
  "steps_executed": 1247,
  "num_running": 1,
  "num_waiting": 0,
  "total_requests_processed": 15,
  "total_prompt_tokens": 28450,
  "total_completion_tokens": 3200,
  "metal": {
    "active_memory_gb": 5.2,
    "peak_memory_gb": 8.1,
    "cache_memory_gb": 2.3
  },
  "cache": {
    "type": "memory_aware_cache",
    "entries": 5,
    "hit_rate": 0.87,
    "memory_mb": 2350
  },
  "requests": [
    {
      "request_id": "req_abc123",
      "phase": "generation",
      "tokens_per_second": 45.2,
      "ttft_s": 0.8,
      "progress": 0.35,
      "cache_hit_type": "prefix",
      "cached_tokens": 1200,
      "generated_tokens": 85,
      "max_tokens": 256
    }
  ]
}

Response fields:

Field Description
status Server state: running, stopped, or not_loaded
model Name of the loaded model
uptime_s Seconds since the server started
steps_executed Total inference steps executed
num_running Number of requests currently generating tokens
num_waiting Number of requests queued for prefill
total_requests_processed Total requests completed since startup
total_prompt_tokens Total prompt tokens processed since startup
total_completion_tokens Total completion tokens generated since startup
metal.active_memory_gb Current Metal GPU memory in use (GB)
metal.peak_memory_gb Peak Metal GPU memory usage (GB)
metal.cache_memory_gb Metal cache memory usage (GB)
cache Cache statistics (type, entries, hit rate, memory usage)
requests List of active requests with per-request details

Per-request fields in requests:

Field Description
request_id Unique request identifier
phase Current phase: queued, prefill, or generation
tokens_per_second Generation throughput for this request
ttft_s Time to first token (seconds)
progress Completion percentage (0.0 to 1.0)
cache_hit_type Cache match type: exact, prefix, supersequence, lcp, or miss
cached_tokens Number of tokens served from cache
generated_tokens Tokens generated so far
max_tokens Maximum tokens requested

Tool Calling

Enable OpenAI-compatible tool calling with --enable-auto-tool-choice:

vllm-mlx serve mlx-community/Devstral-Small-2507-4bit \
  --enable-auto-tool-choice \
  --tool-call-parser mistral

Use the --tool-call-parser option to select the parser for your model:

Parser Models
auto Auto-detect (tries all parsers)
mistral Mistral, Devstral
qwen Qwen, Qwen3
llama Llama 3.x, 4.x
hermes Hermes, NousResearch
deepseek DeepSeek V3, R1
kimi Kimi K2, Moonshot
granite IBM Granite 3.x, 4.x
nemotron NVIDIA Nemotron
xlam Salesforce xLAM
functionary MeetKai Functionary
glm47 GLM-4.7, GLM-4.7-Flash
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"]
            }
        }
    }]
)

if response.choices[0].message.tool_calls:
    for tc in response.choices[0].message.tool_calls:
        print(f"{tc.function.name}: {tc.function.arguments}")

See Tool Calling Guide for full documentation.

Reasoning Models

For models that show their thinking process (Qwen3, DeepSeek-R1), use --reasoning-parser to separate reasoning from the final answer:

# Qwen3 models
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

# DeepSeek-R1 models
vllm-mlx serve mlx-community/DeepSeek-R1-Distill-Qwen-7B-4bit --reasoning-parser deepseek_r1

The API response includes a reasoning field with the model's thought process:

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is 17 × 23?"}]
)

print(response.choices[0].message.reasoning)  # Step-by-step thinking
print(response.choices[0].message.content)    # Final answer

For streaming, reasoning chunks arrive first, followed by content chunks:

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.reasoning:
        print(f"[Thinking] {delta.reasoning}")
    if delta.content:
        print(delta.content, end="")

See Reasoning Models Guide for full details.

Structured Output (JSON Mode)

Force the model to return valid JSON using response_format:

JSON Object Mode

Returns any valid JSON:

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "List 3 colors"}],
    response_format={"type": "json_object"}
)
# Output: {"colors": ["red", "blue", "green"]}

JSON Schema Mode

Returns JSON matching a specific schema:

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "List 3 colors"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "colors",
            "schema": {
                "type": "object",
                "properties": {
                    "colors": {
                        "type": "array",
                        "items": {"type": "string"}
                    }
                },
                "required": ["colors"]
            }
        }
    }
)
# Output validated against schema
data = json.loads(response.choices[0].message.content)
assert "colors" in data

Curl Example

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "List 3 colors"}],
    "response_format": {"type": "json_object"}
  }'

Curl Examples

Chat

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Streaming

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Streaming Configuration

Control streaming behavior with --stream-interval:

Value Behavior
1 (default) Send every token immediately
2-5 Batch tokens before sending
10+ Maximum throughput, chunkier output
# Smooth streaming
vllm-mlx serve model --continuous-batching --stream-interval 1

# Batched streaming (better for high-latency networks)
vllm-mlx serve model --continuous-batching --stream-interval 5

Open WebUI Integration

# 1. Start vllm-mlx server
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

# 2. Start Open WebUI
docker run -d -p 3000:8080 \
  -e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
  -e OPENAI_API_KEY=not-needed \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

# 3. Open http://localhost:3000

Production Deployment

With systemd

Create /etc/systemd/system/vllm-mlx.service:

[Unit]
Description=vLLM-MLX Server
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/vllm-mlx serve mlx-community/Qwen3-0.6B-8bit \
  --continuous-batching --use-paged-cache --port 8000
Restart=always

[Install]
WantedBy=multi-user.target
sudo systemctl enable vllm-mlx
sudo systemctl start vllm-mlx

Recommended Settings

For production with 50+ concurrent users:

vllm-mlx serve mlx-community/Qwen3-0.6B-8bit \
  --continuous-batching \
  --use-paged-cache \
  --api-key your-secret-key \
  --rate-limit 60 \
  --timeout 120 \
  --port 8000