Local-first LLM proxy for Claude Code. Routes simple tasks through your local model (Lemonade), escalates complex reasoning to Anthropic API. Save 90% on tokens.
"Why spend a dollar when a penny will do?" — the architect
Claude Code ──→ proxy:8443 ──→ Local LLM (Lemonade/Qwen) ──→ response
│
└──→ Anthropic API (when local can't handle it)
The proxy intercepts Anthropic API calls and decides where to route them:
- Local first: Simple prompts, single tool calls, short context → Lemonade
- API escalation: Multi-step reasoning, long context, tool chains, local model uncertainty → Anthropic
- Smart retry: If local response quality is low, automatically retry with Anthropic
- Lemonade SDK with a loaded model
- Claude Code CLI
- Python 3.12+
# Start the proxy
claude-hybrid-proxy --local http://localhost:13305 --api-key $ANTHROPIC_API_KEY
# Point Claude Code at it
ANTHROPIC_BASE_URL=http://localhost:8443 claude# ~/.config/claude-hybrid-proxy/config.toml
[local]
url = "http://localhost:13305"
model = "Qwen3.5-35B-A3B-GGUF"
max_tokens = 4096 # local model context budget per turn
[anthropic]
# Falls back to ANTHROPIC_API_KEY env var
model = "claude-sonnet-4-20250514" # default escalation model
[routing]
# Thresholds for local vs API
max_local_prompt_tokens = 8000 # over this → API
max_local_tools = 3 # more simultaneous tools → API
always_api_patterns = ["plan", "architect", "refactor"] # keywords that trigger API
escalate_on_uncertainty = true # retry with API if local seems unsureBuilt for the halo-ai ecosystem on AMD Strix Halo.
- Lemonade SDK (local LLM backend)
- FastAPI (proxy server)
- Anthropic Python SDK (API translation)
MIT
Designed and built by the architect.