Run AI models privately on your own hardware — offline, free, under your control.
| Concern | Cloud API | Local |
|---|---|---|
| Privacy | Data leaves machine | 100% local |
| Cost | Per-token billing | Hardware only |
| Internet | Required | Not needed |
| Size | Min VRAM (Q4) | Speed (t/s) |
|---|---|---|
| 3B | 2GB | 60-120 |
| 7B | 4-5GB | 30-60 |
| 13B | 8GB | 15-30 |
| 70B | 40GB | 2-8 |
| Tool | Best For | API |
|---|---|---|
| Ollama | Easiest setup | OpenAI-compatible REST |
| LM Studio | GUI desktop | OpenAI-compatible REST |
| llama.cpp | Max performance | CLI |
| vLLM | Production serving | REST |
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2
ollama run llama3.2| Use Case | Model | Size |
|---|---|---|
| Fast chat | Llama 3.2 3B | 2GB |
| Quality chat | Mistral 7B | 4GB |
| Code | DeepSeek Coder 6.7B | 4GB |
| Format | Size vs FP16 | Quality |
|---|---|---|
| Q8_0 | 50% | Minimal loss |
| Q4_K_M | 28% | Sweet spot |
| Q2_K | 14% | Noticeable loss |