Llama.cpp in Docker

Run llama.cpp in a GPU accelerated Docker container.

Minimum requirements

By default, the service requires a CUDA capable GPU with at least 8GB+ of VRAM. If you don't have an Nvidia GPU with CUDA then the CPU version will be built and used instead.

Quickstart

make build
make llama-2-13b
make up

After starting up the chat server will be available at http://localhost:8080.

Options

Options can be specified as environment variables in the docker-compose.yml file. Environment variables that are prefixed with LLAMA_ are converted to command line arguments for the llama.cpp server. For example, LLAMA_CTX_SIZE is converted to --ctx-size. By default, the following options are set:

GGML_CUDA_NO_PINNED: Disable pinned memory for compatability (default is 1)
LLAMA_CTX_SIZE: The context size to use (default is 2048)
LLAMA_MODEL: The name of the model to use (default is /models/llama-2-13b-chat.Q5_K_M.gguf)
LLAMA_N_GPU_LAYERS: The number of layers to run on the GPU (default is 99)

See the llama.cpp documentation for the complete list of server options.

Models

The docker-entrypoint.sh has targets for downloading popular models. Run ./docker-entrypoint.sh --help to list available models. Download models by running ./docker-entrypoint.sh <model> or make <model> where <model> is the name of the model. By default, these will download the _Q5_K_M.gguf versions of the models. These models are quantized to 5 bits which provide a good balance between speed and accuracy.

Confused about which model to use? Below is a list of popular models, ranked by ELO rating. Generally, the higher the ELO rating the better the model.

Target	Model	Parameters	Size	~Score	~ELO	Notes
gemma-2-9b	`gemma-2-9b-it`	9B	6.65 GB	28.90	1187	Google's best small model
llama-3-8b	`meta-llama-3.1-8b-instruct`	8B	5.73 GB	26.59	1162	The overall best small model
mistral-7b	`mistral-7b-instruct-v0.2`	7B	5.13 GB	18.44	1072	The most popular 7B model
phi-3-mini	`phi-3-mini-4k-instruct`	3B	3.13 GB	25.97	1066	The current best tiny model
llama-2-13b	`llama-2-13b-chat`	13B	9.23 GB	11.00	1063	The original open LLM

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
models		models
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
Dockerfile-cpu		Dockerfile-cpu
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Llama.cpp in Docker

Minimum requirements

Quickstart

Options

Models

About

Releases 14

Packages

Languages

License

fboulnois/llama-cpp-docker

Folders and files

Latest commit

History

Repository files navigation

Llama.cpp in Docker

Minimum requirements

Quickstart

Options

Models

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 14

Packages 0

Languages

Packages