vllm-mlx provides a FastAPI server with full OpenAI API compatibility.
Maximum throughput for single user:
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000For multiple concurrent users:
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batchingMemory-efficient caching for production:
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching --use-paged-cache| Option | Description | Default |
|---|---|---|
--port |
Server port | 8000 |
--host |
Server host | 0.0.0.0 |
--api-key |
API key for authentication | None |
--rate-limit |
Requests per minute per client (0 = disabled) | 0 |
--timeout |
Request timeout in seconds | 300 |
--enable-metrics |
Expose Prometheus metrics on /metrics |
False |
--continuous-batching |
Enable batching for multi-user | False |
--use-paged-cache |
Enable paged KV cache | False |
--cache-memory-mb |
Cache memory limit in MB | Auto |
--cache-memory-percent |
Fraction of RAM for cache | 0.20 |
--max-tokens |
Default max tokens | 32768 |
--default-temperature |
Default temperature when not specified | None |
--default-top-p |
Default top_p when not specified | None |
--stream-interval |
Tokens per stream chunk | 1 |
--mcp-config |
Path to MCP config file | None |
--reasoning-parser |
Parser for reasoning models (qwen3, deepseek_r1) |
None |
--embedding-model |
Pre-load an embedding model at startup | None |
--enable-auto-tool-choice |
Enable automatic tool calling | False |
--tool-call-parser |
Tool call parser (see Tool Calling) | None |
POST /v1/chat/completionsfrom openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
# Non-streaming
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=100
)
# Streaming
stream = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")POST /v1/completionsresponse = client.completions.create(
model="default",
prompt="The capital of France is",
max_tokens=50
)GET /v1/modelsReturns available models.
POST /v1/embeddingsresponse = client.embeddings.create(
model="mlx-community/multilingual-e5-small-mlx",
input="Hello world"
)
print(response.data[0].embedding[:5]) # First 5 dimensionsSee Embeddings Guide for details.
GET /healthReturns server status.
GET /metricsPrometheus scrape endpoint for server, cache, scheduler, and request metrics.
The endpoint is disabled by default and is enabled with --enable-metrics.
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit \
--enable-metrics/metrics is intentionally unauthenticated. Expose it only on a trusted
network or behind a reverse proxy / firewall that limits who can scrape it.
POST /v1/messagesAnthropic-compatible endpoint that allows tools like Claude Code and OpenCode to connect directly to vllm-mlx. Internally it translates Anthropic requests to OpenAI format, runs inference through the engine, and converts the response back to Anthropic format.
Capabilities:
- Non-streaming and streaming responses (SSE)
- System messages (plain string or list of content blocks)
- Multi-turn conversations with user and assistant messages
- Tool calling with
tool_use/tool_resultcontent blocks - Token counting for budget tracking
- Multimodal content (images via
sourceblocks) - Client disconnect detection (returns HTTP 499)
- Automatic special token filtering in streamed output
from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")
response = client.messages.create(
model="default",
max_tokens=256,
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)
# Response includes: response.id, response.model, response.stop_reason,
# response.usage.input_tokens, response.usage.output_tokensStreaming follows the Anthropic SSE event protocol. Events are emitted in this order:
message_start -> content_block_start -> content_block_delta (repeated) -> content_block_stop -> message_delta -> message_stop
with client.messages.stream(
model="default",
max_tokens=256,
messages=[{"role": "user", "content": "Tell me a story"}]
) as stream:
for text in stream.text_stream:
print(text, end="")System messages can be a plain string or a list of content blocks:
# Plain string
response = client.messages.create(
model="default",
max_tokens=256,
system="You are a helpful coding assistant.",
messages=[{"role": "user", "content": "Write a hello world in Python"}]
)
# List of content blocks
response = client.messages.create(
model="default",
max_tokens=256,
system=[
{"type": "text", "text": "You are a helpful assistant."},
{"type": "text", "text": "Be concise in your answers."},
],
messages=[{"role": "user", "content": "What is 2+2?"}]
)Define tools with name, description, and input_schema. The model returns tool_use content blocks when it wants to call a tool. Send results back as tool_result blocks.
# Step 1: Send request with tools
response = client.messages.create(
model="default",
max_tokens=1024,
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=[{
"name": "get_weather",
"description": "Get weather for a city",
"input_schema": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}]
)
# Step 2: Check if model wants to use tools
for block in response.content:
if block.type == "tool_use":
print(f"Tool: {block.name}, Input: {block.input}, ID: {block.id}")
# response.stop_reason will be "tool_use"
# Step 3: Send tool result back
response = client.messages.create(
model="default",
max_tokens=1024,
messages=[
{"role": "user", "content": "What's the weather in Paris?"},
{"role": "assistant", "content": response.content},
{"role": "user", "content": [
{
"type": "tool_result",
"tool_use_id": block.id,
"content": "Sunny, 22C"
}
]}
],
tools=[{
"name": "get_weather",
"description": "Get weather for a city",
"input_schema": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}]
)
print(response.content[0].text) # "The weather in Paris is sunny, 22C."Tool choice modes:
tool_choice |
Behavior |
|---|---|
{"type": "auto"} |
Model decides whether to call tools (default) |
{"type": "any"} |
Model must call at least one tool |
{"type": "tool", "name": "get_weather"} |
Model must call the specified tool |
{"type": "none"} |
Model will not call any tools |
messages = [
{"role": "user", "content": "My name is Alice."},
{"role": "assistant", "content": "Nice to meet you, Alice!"},
{"role": "user", "content": "What's my name?"},
]
response = client.messages.create(
model="default",
max_tokens=100,
messages=messages
)POST /v1/messages/count_tokensCounts input tokens for an Anthropic request using the model's tokenizer. Useful for budget tracking before sending a request. Counts tokens from system messages, conversation messages, tool_use inputs, tool_result content, and tool definitions (name, description, input_schema).
import requests
resp = requests.post("http://localhost:8000/v1/messages/count_tokens", json={
"model": "default",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"system": "You are helpful.",
"tools": [{
"name": "search",
"description": "Search the web",
"input_schema": {"type": "object", "properties": {"q": {"type": "string"}}}
}]
})
print(resp.json()) # {"input_tokens": 42}Non-streaming:
curl http://localhost:8000/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"max_tokens": 256,
"messages": [{"role": "user", "content": "Hello!"}]
}'Streaming:
curl http://localhost:8000/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"max_tokens": 256,
"stream": true,
"messages": [{"role": "user", "content": "Tell me a joke"}]
}'Token counting:
curl http://localhost:8000/v1/messages/count_tokens \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# {"input_tokens": 12}| Field | Type | Required | Default | Description |
|---|---|---|---|---|
model |
string | yes | - | Model name (use "default" for the loaded model) |
messages |
list | yes | - | Conversation messages with role and content |
max_tokens |
int | yes | - | Maximum number of tokens to generate |
system |
string or list | no | null | System prompt (string or list of {"type": "text", "text": "..."} blocks) |
stream |
bool | no | false | Enable SSE streaming |
temperature |
float | no | 0.7 | Sampling temperature (0.0 = deterministic, 1.0 = creative) |
top_p |
float | no | 0.9 | Nucleus sampling threshold |
top_k |
int | no | null | Top-k sampling |
stop_sequences |
list | no | null | Sequences that stop generation |
tools |
list | no | null | Tool definitions with name, description, input_schema |
tool_choice |
dict | no | null | Tool selection mode (auto, any, tool, none) |
metadata |
dict | no | null | Arbitrary metadata (passed through, not used by server) |
Non-streaming response:
{
"id": "msg_abc123...",
"type": "message",
"role": "assistant",
"model": "default",
"content": [
{"type": "text", "text": "Hello! How can I help?"}
],
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 12,
"output_tokens": 8
}
}When tools are called, content includes tool_use blocks and stop_reason is "tool_use":
{
"content": [
{"type": "text", "text": "Let me check the weather."},
{
"type": "tool_use",
"id": "call_abc123",
"name": "get_weather",
"input": {"city": "Paris"}
}
],
"stop_reason": "tool_use"
}Stop reasons:
stop_reason |
Meaning |
|---|---|
end_turn |
Model finished naturally |
tool_use |
Model wants to call a tool |
max_tokens |
Hit the max_tokens limit |
Point Claude Code directly at your vllm-mlx server:
# Start the server
vllm-mlx serve mlx-community/Qwen3-Coder-Next-235B-A22B-4bit \
--continuous-batching \
--enable-auto-tool-choice \
--tool-call-parser hermes
# In another terminal, configure Claude Code
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claudeGET /v1/statusReal-time monitoring endpoint that returns server-wide statistics and per-request details. Useful for debugging performance, tracking cache efficiency, and monitoring Metal GPU memory.
curl -s http://localhost:8000/v1/status | python -m json.toolExample response:
{
"status": "running",
"model": "mlx-community/Qwen3-8B-4bit",
"uptime_s": 342.5,
"steps_executed": 1247,
"num_running": 1,
"num_waiting": 0,
"total_requests_processed": 15,
"total_prompt_tokens": 28450,
"total_completion_tokens": 3200,
"metal": {
"active_memory_gb": 5.2,
"peak_memory_gb": 8.1,
"cache_memory_gb": 2.3
},
"cache": {
"type": "memory_aware_cache",
"entries": 5,
"hit_rate": 0.87,
"memory_mb": 2350
},
"requests": [
{
"request_id": "req_abc123",
"phase": "generation",
"tokens_per_second": 45.2,
"ttft_s": 0.8,
"progress": 0.35,
"cache_hit_type": "prefix",
"cached_tokens": 1200,
"generated_tokens": 85,
"max_tokens": 256
}
]
}Response fields:
| Field | Description |
|---|---|
status |
Server state: running, stopped, or not_loaded |
model |
Name of the loaded model |
uptime_s |
Seconds since the server started |
steps_executed |
Total inference steps executed |
num_running |
Number of requests currently generating tokens |
num_waiting |
Number of requests queued for prefill |
total_requests_processed |
Total requests completed since startup |
total_prompt_tokens |
Total prompt tokens processed since startup |
total_completion_tokens |
Total completion tokens generated since startup |
metal.active_memory_gb |
Current Metal GPU memory in use (GB) |
metal.peak_memory_gb |
Peak Metal GPU memory usage (GB) |
metal.cache_memory_gb |
Metal cache memory usage (GB) |
cache |
Cache statistics (type, entries, hit rate, memory usage) |
requests |
List of active requests with per-request details |
Per-request fields in requests:
| Field | Description |
|---|---|
request_id |
Unique request identifier |
phase |
Current phase: queued, prefill, or generation |
tokens_per_second |
Generation throughput for this request |
ttft_s |
Time to first token (seconds) |
progress |
Completion percentage (0.0 to 1.0) |
cache_hit_type |
Cache match type: exact, prefix, supersequence, lcp, or miss |
cached_tokens |
Number of tokens served from cache |
generated_tokens |
Tokens generated so far |
max_tokens |
Maximum tokens requested |
Enable OpenAI-compatible tool calling with --enable-auto-tool-choice:
vllm-mlx serve mlx-community/Devstral-Small-2507-4bit \
--enable-auto-tool-choice \
--tool-call-parser mistralUse the --tool-call-parser option to select the parser for your model:
| Parser | Models |
|---|---|
auto |
Auto-detect (tries all parsers) |
mistral |
Mistral, Devstral |
qwen |
Qwen, Qwen3 |
llama |
Llama 3.x, 4.x |
hermes |
Hermes, NousResearch |
deepseek |
DeepSeek V3, R1 |
kimi |
Kimi K2, Moonshot |
granite |
IBM Granite 3.x, 4.x |
nemotron |
NVIDIA Nemotron |
xlam |
Salesforce xLAM |
functionary |
MeetKai Functionary |
glm47 |
GLM-4.7, GLM-4.7-Flash |
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}]
)
if response.choices[0].message.tool_calls:
for tc in response.choices[0].message.tool_calls:
print(f"{tc.function.name}: {tc.function.arguments}")See Tool Calling Guide for full documentation.
For models that show their thinking process (Qwen3, DeepSeek-R1), use --reasoning-parser to separate reasoning from the final answer:
# Qwen3 models
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3
# DeepSeek-R1 models
vllm-mlx serve mlx-community/DeepSeek-R1-Distill-Qwen-7B-4bit --reasoning-parser deepseek_r1The API response includes a reasoning field with the model's thought process:
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "What is 17 × 23?"}]
)
print(response.choices[0].message.reasoning) # Step-by-step thinking
print(response.choices[0].message.content) # Final answerFor streaming, reasoning chunks arrive first, followed by content chunks:
for chunk in stream:
delta = chunk.choices[0].delta
if delta.reasoning:
print(f"[Thinking] {delta.reasoning}")
if delta.content:
print(delta.content, end="")See Reasoning Models Guide for full details.
Force the model to return valid JSON using response_format:
Returns any valid JSON:
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "List 3 colors"}],
response_format={"type": "json_object"}
)
# Output: {"colors": ["red", "blue", "green"]}Returns JSON matching a specific schema:
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "List 3 colors"}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "colors",
"schema": {
"type": "object",
"properties": {
"colors": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["colors"]
}
}
}
)
# Output validated against schema
data = json.loads(response.choices[0].message.content)
assert "colors" in datacurl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "List 3 colors"}],
"response_format": {"type": "json_object"}
}'curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Control streaming behavior with --stream-interval:
| Value | Behavior |
|---|---|
1 (default) |
Send every token immediately |
2-5 |
Batch tokens before sending |
10+ |
Maximum throughput, chunkier output |
# Smooth streaming
vllm-mlx serve model --continuous-batching --stream-interval 1
# Batched streaming (better for high-latency networks)
vllm-mlx serve model --continuous-batching --stream-interval 5# 1. Start vllm-mlx server
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000
# 2. Start Open WebUI
docker run -d -p 3000:8080 \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=not-needed \
--name open-webui \
ghcr.io/open-webui/open-webui:main
# 3. Open http://localhost:3000Create /etc/systemd/system/vllm-mlx.service:
[Unit]
Description=vLLM-MLX Server
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/vllm-mlx serve mlx-community/Qwen3-0.6B-8bit \
--continuous-batching --use-paged-cache --port 8000
Restart=always
[Install]
WantedBy=multi-user.targetsudo systemctl enable vllm-mlx
sudo systemctl start vllm-mlxFor production with 50+ concurrent users:
vllm-mlx serve mlx-community/Qwen3-0.6B-8bit \
--continuous-batching \
--use-paged-cache \
--api-key your-secret-key \
--rate-limit 60 \
--timeout 120 \
--port 8000