feat: add LiteLLM as AI gateway backend#18
Conversation
mogwai
left a comment
There was a problem hiding this comment.
Thanks for this — nicely tested and it works end-to-end. Before merge I'd like to reshape it a bit, because a LiteLLM proxy is just an OpenAI-compatible /v1/chat/completions endpoint, and vui already has an OpenAI-compatible client: VLLMBackend. As written, LiteLLMBackend re-implements ~110 lines of it (stream, complete, list_models, _record_stats are near byte-identical), which means future fixes have to be made in two places.
There are only three genuine differences between vLLM-local and a cloud gateway, so this can subclass VLLMBackend and override just those:
class LiteLLMBackend(VLLMBackend):
"""OpenAI-compatible cloud gateway (LiteLLM proxy -> 100+ providers).
Same wire protocol as vLLM; differs only in dropping vLLM-only body
knobs, skipping prefill (no local KV to warm), and supplying auth.
Run the proxy with `drop_params: true` to let LiteLLM strip params
individual providers don't support.
"""
name = "litellm"
def __init__(
self,
model: str = "openai/gpt-4o-mini",
base_url: str = "http://localhost:4000",
*,
sampling: dict | None = None,
):
super().__init__(model=model, base_url=base_url, sampling=sampling)
def _client_inst(self) -> httpx.AsyncClient:
client = super()._client_inst()
key = os.environ.get("VUI_LITELLM_KEY")
if key:
client.headers["Authorization"] = f"Bearer {key}"
return client
def _body(self, messages, **kw) -> dict:
body = super()._body(messages, **kw)
body.pop("top_k", None) # not OpenAI-standard
body.pop("chat_template_kwargs", None) # vLLM-specific
return body
async def prefill(self, messages) -> None:
return # remote provider: no local KV to warm; a real call would just bill
async def set_model(self, name: str) -> None:
self.model = namectx_max then inherits vLLM's 8192 default, which keeps the context-fill gauge meaningful.
The substantive issues this addresses:
1. (blocker) prefill bills the cloud provider on a hot path. The inherited base prefill does complete(max_tokens=1) — a real round-trip. It's called in llm.py, voice_turn.py and thoughts.py, and per the comment on _client_inst spec-prefill fires "every few hundred ms during user speech." For Ollama/vLLM that's free local KV warming; against a cloud provider it's a billed request every few hundred ms with no benefit (there's no warmable KV behind a proxy). Needs the no-op override above.
2. (blocker) No auth. _client_inst() sends no Authorization header and there's no key env var, so the localhost:4000 default only works against an unauthenticated local proxy. Any real setup — a proxy with master_key/virtual keys, a remote proxy, or a provider directly — 401s. The VUI_LITELLM_KEY override above fixes this; please also document it (the README/docstring currently mention only VUI_LITELLM_URL and VUI_LITELLM_MODEL).
3. Provider param compatibility belongs in the proxy. Rather than hand-omitting params on the client (and the _resolve_sampling() call currently computes top_k/top_p/presence_penalty then discards them), drop only the two non-OpenAI knobs (top_k, chat_template_kwargs) and let LiteLLM's drop_params: true strip the rest per-provider — that's what it's for.
Minor: test_litellm_body_includes_drop_params is misleadingly named — there's no drop_params in the body; it actually asserts presence_penalty is absent. The subclass keeps your existing make_backend branch and tests working as-is (the body assertions still hold).
Summary
Add LiteLLM as a third LLM backend alongside Ollama and vLLM, enabling access to 100+ cloud LLM providers (OpenAI, Anthropic, Azure, Bedrock, Vertex, Groq, etc.) through the LiteLLM proxy.
Currently vui only supports local inference (Ollama, vLLM). This adds cloud provider support via
VUI_LLM_BACKEND=litellm.Changes
src/vui/serving/stream/llm_backend.pyLiteLLMBackendclass withstream(),complete(),list_models(),set_model()/v1/chat/completionsendpoint/v1/modelslitellmbranch inmake_backend()factoryVUI_LLM_BACKEND=litellm,VUI_LITELLM_URL,VUI_LITELLM_MODELtests/test_litellm_backend.py- 5 unit testsTests
Unit tests (5/5 pass):
Live E2E (LiteLLM proxy -> Claude Sonnet via Azure AI Foundry):
Complete, streaming, and model discovery all verified end-to-end.
Example usage