Skip to content

feat: add LiteLLM as AI gateway backend#18

Open
RheagalFire wants to merge 2 commits into
fluxions-ai:mainfrom
RheagalFire:feat/add-litellm-provider
Open

feat: add LiteLLM as AI gateway backend#18
RheagalFire wants to merge 2 commits into
fluxions-ai:mainfrom
RheagalFire:feat/add-litellm-provider

Conversation

@RheagalFire

Copy link
Copy Markdown

Summary

Add LiteLLM as a third LLM backend alongside Ollama and vLLM, enabling access to 100+ cloud LLM providers (OpenAI, Anthropic, Azure, Bedrock, Vertex, Groq, etc.) through the LiteLLM proxy.

Currently vui only supports local inference (Ollama, vLLM). This adds cloud provider support via VUI_LLM_BACKEND=litellm.

Changes

  • src/vui/serving/stream/llm_backend.py
    • Added LiteLLMBackend class with stream(), complete(), list_models(), set_model()
    • Uses OpenAI-compatible /v1/chat/completions endpoint
    • Only sends provider-safe params (temperature, max_tokens) to avoid cross-provider rejection
    • Auto-discovers available models via /v1/models
    • Added litellm branch in make_backend() factory
    • Config: VUI_LLM_BACKEND=litellm, VUI_LITELLM_URL, VUI_LITELLM_MODEL
  • tests/test_litellm_backend.py - 5 unit tests

Tests

Unit tests (5/5 pass):

test_make_backend_litellm PASSED
test_litellm_body_includes_drop_params PASSED
test_litellm_body_with_tools PASSED
test_litellm_default_model PASSED
test_litellm_set_model PASSED

Live E2E (LiteLLM proxy -> Claude Sonnet via Azure AI Foundry):

Response: 4
Usage: {'prompt': 18, 'completion': 5, 'ctx_used': 23, 'ctx_max': 0}
Stream: Hi|!
Models: ['anthropic/claude-sonnet-4-6']

Complete, streaming, and model discovery all verified end-to-end.

Example usage

# Start LiteLLM proxy
pip install litellm
litellm --model anthropic/claude-sonnet-4-6

# Run vui with LiteLLM backend
VUI_LLM_BACKEND=litellm VUI_LITELLM_MODEL=anthropic/claude-sonnet-4-6 vui serve
from vui.serving.stream.llm_backend import make_backend

backend = make_backend("litellm", model="anthropic/claude-sonnet-4-6")

# Non-streaming
result = await backend.complete(
    [{"role": "user", "content": "Hello!"}],
    max_tokens=100,
)
print(result["content"])

# Streaming
async for token in backend.stream(
    [{"role": "user", "content": "Tell me a story"}],
    max_tokens=200,
):
    print(token, end="", flush=True)

@mogwai mogwai left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — nicely tested and it works end-to-end. Before merge I'd like to reshape it a bit, because a LiteLLM proxy is just an OpenAI-compatible /v1/chat/completions endpoint, and vui already has an OpenAI-compatible client: VLLMBackend. As written, LiteLLMBackend re-implements ~110 lines of it (stream, complete, list_models, _record_stats are near byte-identical), which means future fixes have to be made in two places.

There are only three genuine differences between vLLM-local and a cloud gateway, so this can subclass VLLMBackend and override just those:

class LiteLLMBackend(VLLMBackend):
    """OpenAI-compatible cloud gateway (LiteLLM proxy -> 100+ providers).

    Same wire protocol as vLLM; differs only in dropping vLLM-only body
    knobs, skipping prefill (no local KV to warm), and supplying auth.
    Run the proxy with `drop_params: true` to let LiteLLM strip params
    individual providers don't support.
    """

    name = "litellm"

    def __init__(
        self,
        model: str = "openai/gpt-4o-mini",
        base_url: str = "http://localhost:4000",
        *,
        sampling: dict | None = None,
    ):
        super().__init__(model=model, base_url=base_url, sampling=sampling)

    def _client_inst(self) -> httpx.AsyncClient:
        client = super()._client_inst()
        key = os.environ.get("VUI_LITELLM_KEY")
        if key:
            client.headers["Authorization"] = f"Bearer {key}"
        return client

    def _body(self, messages, **kw) -> dict:
        body = super()._body(messages, **kw)
        body.pop("top_k", None)                  # not OpenAI-standard
        body.pop("chat_template_kwargs", None)   # vLLM-specific
        return body

    async def prefill(self, messages) -> None:
        return  # remote provider: no local KV to warm; a real call would just bill

    async def set_model(self, name: str) -> None:
        self.model = name

ctx_max then inherits vLLM's 8192 default, which keeps the context-fill gauge meaningful.

The substantive issues this addresses:

1. (blocker) prefill bills the cloud provider on a hot path. The inherited base prefill does complete(max_tokens=1) — a real round-trip. It's called in llm.py, voice_turn.py and thoughts.py, and per the comment on _client_inst spec-prefill fires "every few hundred ms during user speech." For Ollama/vLLM that's free local KV warming; against a cloud provider it's a billed request every few hundred ms with no benefit (there's no warmable KV behind a proxy). Needs the no-op override above.

2. (blocker) No auth. _client_inst() sends no Authorization header and there's no key env var, so the localhost:4000 default only works against an unauthenticated local proxy. Any real setup — a proxy with master_key/virtual keys, a remote proxy, or a provider directly — 401s. The VUI_LITELLM_KEY override above fixes this; please also document it (the README/docstring currently mention only VUI_LITELLM_URL and VUI_LITELLM_MODEL).

3. Provider param compatibility belongs in the proxy. Rather than hand-omitting params on the client (and the _resolve_sampling() call currently computes top_k/top_p/presence_penalty then discards them), drop only the two non-OpenAI knobs (top_k, chat_template_kwargs) and let LiteLLM's drop_params: true strip the rest per-provider — that's what it's for.

Minor: test_litellm_body_includes_drop_params is misleadingly named — there's no drop_params in the body; it actually asserts presence_penalty is absent. The subclass keeps your existing make_backend branch and tests working as-is (the body assertions still hold).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants