llmstack

Chat with any codebase. Locally. Privately. Free.

The open-source alternative to Cursor and Copilot — runs entirely on your machine with Ollama.
Plus: smart routing, fine-tuning, AI agents, MCP server, and full observability.

Ask • Gateway • Smart Routing • Agents & MCP • Fine-tuning • Observability

pip install llmstack-cli
llmstack ask -i ./src/    # start chatting with your codebase

Ask Your Codebase Anything

llmstack ask "How does authentication work?" ./src/

One command. No API keys. No cloud. No Docker. No $20/month subscription. Just Ollama + your files.

  llmstack ask  model=llama3.2  embeddings=nomic-embed-text

  Git: main (15 recent commits)
  Index cached: 847 chunks (0 files changed)
  Embeddings loaded from cache

  Answer:
  Authentication works through API key validation in the FastAPI gateway
  middleware. Each request must include an `Authorization: Bearer <key>`
  header. The middleware validates keys against the stored list in
  llmstack.yaml [src/gateway/middleware/auth.py:23-45]. Rate limiting
  is tied to the API key — each key gets its own token bucket tracked
  in Redis [src/gateway/middleware/rate_limit.py:12-38].

  ┌─────────────── Sources ───────────────┐
  │ File                  Lines   Score    │
  │ gateway/middleware/auth.py  23-45  0.0142  │
  │ gateway/middleware/rate_limit.py  12-38  0.0098  │
  │ config/schema.py      89-102  0.0076  │
  └───────────────────────────────────────┘

Why this is better than Cursor/Copilot/Aider

	llmstack ask	Cursor	Copilot	Aider	Khoj
AST-aware code chunking	Yes	Yes	-	Partial	No
Hybrid search (BM25 + vector)	Yes	?	-	No	No
Persistent incremental index	Yes	Yes	-	No	Yes
Git-aware context	Yes	Yes	-	Yes	No
Interactive conversation	Yes	Yes	-	Yes	Yes
20+ file types (PDF, DOCX, logs...)	Yes	No	No	No	Yes
100% local, 100% private	Yes	No	No	No	Yes
100% free, forever	Yes	$20/mo	$10/mo	API costs	Free
Zero config CLI	Yes	IDE only	IDE only	Config needed	Server needed

Key features

Persistent index — first query indexes your project (~30s). Every query after that: ~0.1s. Only re-embeds files that changed (SHA-256 hash diff).

AST-aware chunking — Python files split by functions and classes using the ast module. Large classes (>50 lines) split into individual methods. JS/TS/Go/Rust/Java use regex boundary detection. No more broken chunks mid-function.

Hybrid search — combines BM25 keyword matching (catches exact function names, error messages) with vector cosine similarity (catches meaning and intent). Merged via Reciprocal Rank Fusion. Better recall than either alone.

Git-aware — the LLM sees your current branch, recent commits, and changed files. Ask "what changed this week?" and get real answers.

Interactive mode — multi-turn conversation with your codebase. Context preserved across questions.

# Interactive conversation with your project
llmstack ask -i ./src/
# You: How does the cache work?
# Assistant: The cache uses Redis with SHA-256 keys...
# You: What happens when Redis goes down?
# Assistant: There's an in-memory fallback in rate_limit.py...

# Single question
llmstack ask "Find security vulnerabilities" ./src/ --model llama3.1:70b

# Ask about any file type
llmstack ask "Summarize the key findings" report.pdf
llmstack ask "What went wrong at 3am?" error.log
cat contract.pdf | llmstack ask "Are there any risks?"

# Skip cache for fresh re-index
llmstack ask "What's new?" ./src/ --no-cache

# Without git context
llmstack ask "Explain the architecture" ./src/ --no-git

20+ file types: Python, JavaScript, TypeScript, Go, Rust, Java, C/C++, Ruby, PHP, Swift, Kotlin, PDF, DOCX, Markdown, HTML, JSON, YAML, TOML, CSV, logs, and more.

Quick Start

# Install
pip install llmstack-cli

# Chat with your codebase (just needs Ollama)
llmstack ask -i ./src/

# Full LLM stack with smart routing
llmstack init --preset router
llmstack up

Universal Gateway

Route every request through a single OpenAI-compatible endpoint. llmstack picks the best provider and model automatically.

6 cloud providers + local inference:

Provider	Models	Pricing tracked
OpenAI	GPT-4o, GPT-4.1, o3, o4-mini, GPT-4.1-nano	Per-token
Anthropic	Claude Opus 4, Sonnet 4, Haiku 4	Per-token
Google	Gemini 2.5 Pro/Flash, Gemini 2.0 Flash	Per-token
Groq	Llama 3.3 70B, Llama 3.1 8B, Mixtral	Per-token
Together	Llama 405B, DeepSeek R1/V3, Qwen 72B	Per-token
Mistral	Mistral Large/Small, Codestral, Pixtral	Per-token
Local	Ollama / vLLM (any GGUF model)	Free

Fallback chains: if OpenAI returns 429/503, the request automatically retries on Anthropic, then falls back to local.

# llmstack.yaml
providers:
  enabled: true
  strategy: cost          # cost | quality | balanced | latency
  providers:
    - name: openai
      api_key_env: OPENAI_API_KEY
      models:
        - name: gpt-4.1-nano
          tier: simple
          cost_per_m_input: 0.10
        - name: gpt-4o
          tier: medium
          cost_per_m_input: 2.50
      fallback: [anthropic, local]
    - name: anthropic
      api_key_env: ANTHROPIC_API_KEY
    - name: local

Response headers tell you exactly what happened:

X-Provider: openai
X-Model-Router: gpt-4.1-nano
X-Query-Tier: simple
X-Cost-USD: 0.000003
X-Cache: MISS

Smart Routing

The classifier scores every request in < 2ms using 7 heuristic signals (no ML model needed), then picks the cheapest adequate model.

Request → Classify (7 signals, <2ms) → Route to optimal model + provider

Signal	What it measures
Token count	Message length
Task markers	"hello" vs "implement distributed consensus"
Code detection	Code blocks, programming terms
Conversation depth	Turn count
System prompt	Complexity of instructions
Language mix	Multilingual content
Question patterns	Simple fact vs multi-constraint reasoning

Real results (CPU-only, no GPU):

Query	Tier	Model	Latency
"Hello!"	Simple	llama3.2:1b	1.6s
"What is 2+2?"	Simple	llama3.2:1b	5.9s
"Write binary search in Python"	Medium	llama3.2:3b	52.2s

71% of requests routed to the small model. 71% compute savings.

With cloud providers, cost savings are even bigger — simple queries go to gpt-4.1-nano ($0.10/M) instead of gpt-4o ($2.50/M).

AI Agents & MCP

Agents with tool use

llmstack agent "Find all TODO comments in this repo and summarize them"

The agent uses a ReAct loop — it plans, calls tools, observes results, and iterates until the task is done.

6 built-in tools: read_file, write_file, list_directory, grep, shell, http_get

# Use specific tools only
llmstack agent "Check if tests pass" --tools shell,read_file

# Use a larger model for complex tasks
llmstack agent "Refactor auth.py to use JWT tokens" --model llama3.1:70b

MCP Server

Connect any MCP-compatible AI client (Claude Code, Cursor, VS Code) to your local LLM:

llmstack mcp --model llama3.2

// .claude/claude_desktop_config.json
{
  "mcpServers": {
    "llmstack": {
      "command": "llmstack",
      "args": ["mcp", "--model", "llama3.2"]
    }
  }
}

8 tools exposed via MCP: all agent tools + llmstack_chat (LLM inference) + llmstack_ask (file RAG with citations).

Fine-tuning

Fine-tune a model on your data in one command. No Jupyter. No boilerplate. No ML expertise.

llmstack finetune data.jsonl --base llama3.2:1b --export-ollama my-model

What happens:

Auto data prep — detects format (CSV/JSON/JSONL/TXT/Parquet), auto-maps columns (instruction/output, prompt/completion, question/answer, chat messages), splits train/eval
Auto hyperparameters — epochs, LoRA rank, batch size, learning rate all auto-selected based on dataset size and model
Training — LoRA/QLoRA via unsloth (2x faster) or HuggingFace PEFT
Export — GGUF conversion + ollama create → model ready to serve

# Override any hyperparameter
llmstack finetune data.csv --base llama3.2:1b --epochs 5 --lr 1e-4 --lora-r 32

# Export to GGUF with custom quantization
llmstack finetune data.jsonl --base llama3.2:1b --export-gguf --quant q5_k_m

# Full pipeline: train + export + register in Ollama
llmstack finetune emails.jsonl --base llama3.2:1b --export-ollama email-assistant
# → ollama run email-assistant

Auto hyperparameter selection:

Dataset size	Epochs	LoRA rank	Learning rate
< 100	5	8	1e-4
100–500	3	16	2e-4
500–5K	2	16	2e-4
5K+	1	32+	2e-4

AI Observability

Every response is scored in real-time. Quality drift triggers alerts. Compare models with A/B testing.

Quality scoring (every response, < 1ms)

5 metrics scored on every non-streaming response:

Metric	What it measures
Coherence	Structural quality (length, sentences, formatting)
Relevance	Does the response address the query?
Refusal	"I can't help with that" detection
Toxicity	Harmful content flags
Repetition	Looping / repetitive output

Drift detection & alerts

Quality drops below 0.4 → CRITICAL alert
Quality trending negative over 50 requests → WARNING alert

# Check live quality from the gateway
llmstack eval --gateway-url http://localhost:8000

┌──────────── Quality Summary ────────────┐
│ Metric     Mean    Recent  Trend  Count  │
│ overall    0.7821  0.7534  -0.02   1042  │
│ coherence  0.8912  0.8845  +0.01   1042  │
│ relevance  0.6834  0.6223  -0.06   1042  │  ← trending down
│ refusal    0.0124  0.0098  -0.00   1042  │
│ repetition 0.0231  0.0187  -0.00   1042  │
└─────────────────────────────────────────┘

A/B testing

# Create a test via API
curl -X POST http://localhost:8000/v1/observe/ab-test \
  -d '{"name":"gpt4o-vs-sonnet","model_a":"gpt-4o","model_b":"claude-sonnet-4-20250514","traffic_split":0.5}'

# Check results
curl http://localhost:8000/v1/observe/ab-test/gpt4o-vs-sonnet

{
  "winner": "claude-sonnet-4-20250514",
  "confidence": "high",
  "avg_quality_a": 0.7821,
  "avg_quality_b": 0.8234,
  "requests_a": 523,
  "requests_b": 519,
  "avg_cost_a_usd": 0.000034,
  "avg_cost_b_usd": 0.000089
}

Request tracing

Every request is traced end-to-end:

GET /v1/observe/traces?model=gpt-4o&limit=10

Each trace captures: prompt, routing decision, provider, model, response, latency, tokens, cost, quality scores.

More about `llmstack ask`

See the top of this README for the full feature breakdown. Under the hood:

Files → AST chunker (functions/classes) → Embed (Ollama) → Persistent SQLite index
                                                                    ↓
Question → BM25 keyword search ──┐
                                 ├── Reciprocal Rank Fusion → Top-K context → LLM → Streamed answer
Question → Vector cosine search ─┘
                                                                    ↑
                                                          Git context (branch, commits, diff)

Full Stack Architecture

llmstack up
    │
    ├── Qdrant (vector DB)          :6333
    ├── Redis (cache + rate limit)  :6379
    ├── Ollama / vLLM (inference)   :11434
    ├── TEI (embeddings)            :8002
    ├── Gateway                     :8000
    │   ├── Smart Router (<2ms classification)
    │   ├── Provider Registry (6 cloud + local)
    │   ├── Semantic Cache (Redis, <1ms hit)
    │   ├── Circuit Breaker (3-state, exponential backoff)
    │   ├── Rate Limiter (token bucket, Redis + Lua)
    │   ├── Quality Scorer (5 metrics, every response)
    │   ├── Trace Store (5K rolling window)
    │   ├── RAG Pipeline (ingest + query)
    │   └── Web UI (chat, RAG, dashboard)
    ├── Prometheus (metrics)
    └── Grafana (dashboard)         :8080

Auto hardware detection:

Hardware	Backend	Why
NVIDIA GPU 16GB+	vLLM	PagedAttention, max throughput
NVIDIA GPU < 16GB	Ollama	Lower memory overhead
Apple Silicon	Ollama	Metal acceleration
CPU only	Ollama	GGUF quantized models

CLI Reference

Command	Description
`llmstack ask <question> [path]`	Ask questions about local files (persistent index, hybrid search)
`llmstack ask -i [path]`	Interactive conversation with your codebase
`llmstack init [--preset]`	Create config (presets: chat, rag, router, agent)
`llmstack up`	Start all services
`llmstack down`	Stop all services
`llmstack status`	Health check
`llmstack chat`	Interactive terminal chat
`llmstack agent <task>`	Run an AI agent with tools
`llmstack mcp`	Start MCP server for AI clients
`llmstack finetune <data>`	Fine-tune a model on your data
`llmstack eval`	Evaluate model quality
`llmstack bench`	Benchmark routing performance
`llmstack export`	Generate docker-compose.yml
`llmstack logs <service>`	Stream service logs
`llmstack doctor`	Diagnose system issues

API Endpoints

Endpoint	Description
`POST /v1/chat/completions`	Chat (auto-routed across providers)
`POST /v1/embeddings`	Text embeddings
`GET /v1/models`	List models from all providers (with pricing)
`POST /v1/rag/ingest`	Ingest documents for RAG
`POST /v1/rag/query`	RAG query with citations
`GET /v1/observe/traces`	Request traces with quality scores
`GET /v1/observe/quality`	Quality summary with drift detection
`GET /v1/observe/alerts`	Active quality alerts
`POST /v1/observe/ab-test`	Create A/B test
`GET /v1/observe/ab-test/{name}`	A/B test results
`GET /healthz`	System health
`GET /metrics`	Prometheus metrics

Comparison

Codebase Q&A

	llmstack ask	Cursor	Aider	Khoj	Simon's llm
AST code chunking	Yes	Yes	Partial	No	No
Hybrid search (BM25 + vector)	Yes	?	No	No	No
Persistent incremental index	Yes	Yes	No	Yes	Manual
Git-aware context	Yes	Yes	Yes	No	No
Interactive conversation	Yes	Yes	Yes	Yes	No
20+ file types	Yes	No	No	Yes	No
100% local + free	Yes	No	No	Yes	Yes
Zero config CLI	Yes	No	No	No	Yes

LLM Platform

	llmstack	Ollama	LiteLLM	LocalAI	LangSmith
Multi-provider gateway	Yes	-	Yes	-	-
Smart cost-aware routing	Yes	-	-	-	-
Fallback chains	Yes	-	Yes	-	-
AI quality scoring	Yes	-	-	-	Yes
Drift detection + alerts	Yes	-	-	-	Yes
A/B testing	Yes	-	-	-	Yes
One-command fine-tuning	Yes	-	-	-	-
AI agents with tools	Yes	-	-	-	-
MCP server	Yes	-	-	-	-
Local inference	Yes	Yes	-	Yes	-
Self-hosted / free	Yes	Yes	Partial	Yes	Paid

Configuration

# llmstack.yaml
version: "1"

models:
  chat:
    name: llama3.2
    backend: auto
  embeddings:
    name: bge-m3

providers:
  enabled: true
  strategy: cost
  providers:
    - name: openai
      api_key_env: OPENAI_API_KEY
      fallback: [anthropic, local]
    - name: anthropic
      api_key_env: ANTHROPIC_API_KEY
    - name: local

observe:
  quality_tracking: true
  alert_threshold: 0.4
  drift_threshold: -0.1

gateway:
  port: 8000
  auth: api_key
  rate_limit: 100/min

Requirements

Python 3.11+
llmstack ask: Ollama running locally. No Docker needed.
Full stack (llmstack up): Docker
Fine-tuning: pip install llmstack-cli[finetune] (adds PyTorch, PEFT, TRL)

Contributing

See CONTRIBUTING.md for development setup. PRs welcome.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 400 Commits
.github		.github
assets		assets
deploy		deploy
docs		docs
examples		examples
sdks/typescript		sdks/typescript
src/llmstack		src/llmstack
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
demo.gif		demo.gif
demo.tape		demo.tape
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml
install.sh		install.sh
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llmstack

Ask Your Codebase Anything

Why this is better than Cursor/Copilot/Aider

Key features

Quick Start

Universal Gateway

Smart Routing

AI Agents & MCP

Agents with tool use

MCP Server

Fine-tuning

AI Observability

Quality scoring (every response, < 1ms)

Drift detection & alerts

A/B testing

Request tracing

More about `llmstack ask`

Full Stack Architecture

CLI Reference

API Endpoints

Comparison

Codebase Q&A

LLM Platform

Configuration

Requirements

Contributing

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llmstack

Ask Your Codebase Anything

Why this is better than Cursor/Copilot/Aider

Key features

Quick Start

Universal Gateway

Smart Routing

AI Agents & MCP

Agents with tool use

MCP Server

Fine-tuning

AI Observability

Quality scoring (every response, < 1ms)

Drift detection & alerts

A/B testing

Request tracing

More about llmstack ask

Full Stack Architecture

CLI Reference

API Endpoints

Comparison

Codebase Q&A

LLM Platform

Configuration

Requirements

Contributing

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

More about `llmstack ask`

Packages