tl;dr, Deliver LLMs of GGUF format via Dockerfile.
GGUF is a file format for storing models for inference with GGML and executors based on GGML. GGUF is a binary format that is designed for fast loading and saving of models, and for ease of reading. Models are traditionally developed using PyTorch or another framework, and then converted to GGUF for use in GGML.
GGUF Packer aids in building Large Language Model (LLM) distributions. All you need is Docker (or BuildKit daemon).
- Efficient Distribution: GGUF Packer uses the BuildKit frontend to streamline the building of LLM distributions.
- Docker Integration: It leverages Docker and BuildKit for seamless build process, allowing the use of Dockerfile directly instead of the Ollama Model File.
- Cloud-Native Support: It aligns with cloud-native practices, referencing KEP-4639 OCI VolumeSource PoC.
Install Docker and GGUF Packer.
To get started, create a Dockefile
file with the following content:
# syntax=gpustack/gguf-packer:latest
ARG BASE=scratch
ARG QUANTIZE_TYPE=Q5_K_M
ARG CHAT_TEMPLATE="{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
FROM scratch AS f16
ADD https://huggingface.co/Qwen/Qwen2-0.5B-Instruct.git Qwen2-0.5B-Instruct
CONVERT --type=F16 Qwen2-0.5B-Instruct Qwen2-0.5B-Instruct.F16.gguf
FROM ${BASE}
LABEL gguf.model.from="Hugging Face"
QUANTIZE --from=f16 --type=${QUANTIZE_TYPE} Qwen2-0.5B-Instruct.F16.gguf Qwen2-0.5B-Instruct.${QUANTIZE_TYPE}.gguf
CAT <<EOF system-prompt.txt
You are an AI assistant that follows instruction extremely well. Help as much as you can.
In answering questions, follow these steps:
1. Entity Identification: Identify the main entities involved in the query. Entities can include people, objects, events, or abstract concepts.
2. Relationship Comprehension: Determine the relationships between these entities. Relationships may be explicit in the text or may need to be inferred based on context and general knowledge.
3. Implication Understanding: Understand the general implications of these relationships. These implications can be based on established facts, principles, or rules related to the identified relationships.
4. Question Contextualization: Align the implications of the relationships with the context of the query. This alignment should guide your response to the query.
5. Answer Generation: Based on the understanding of the entities, their relationships, and implications, generate an appropriate response to the query.
EOF
CMD ["-m", "Qwen2-0.5B-Instruct.${QUANTIZE_TYPE}.gguf", "-c", "8192", "--system-prompt-file", "system-prompt.txt", "--chat-template", "${CHAT_TEMPLATE}"]
The provided Dockerfile
will build a distribution package for
the Qwen2-0.5B-Instruct large language model (LLM). The
model has been quantized using the Q5_K_M
quantization technique, which reduces the model size and inference latency
without significantly impacting accuracy.
ARG BASE=...
: The base image for the build, default isscratch
.ARG QUANTIZE_TYPE=...
: The quantize type for the model, default isQ5_K_M
.ARG CHAT_TEMPLATE="..."
: The chat template for the model, default is the OpenAI GPT-3 chat template.FROM scratch AS f16
: The first stage to convert the model toF16
format, namedf16
.ADD ...
: Add the model from the Hugging Face repository.CONVERT ...
: Convert the model toF16
format.
FROM ${BASE}
: The second stage to quantize the model from the first stage, and set the system prompt and chat template.LABEL ...
: Add metadata to the image.QUANTIZE ...
: Quantize the model from the first stage.CAT ...
: Concatenate the system prompt to a file.CMD ...
: Specify the default commands.
The ADD
instruction allows you to clone the model from a Git repository. However, please note that the Git LFS (Large
File Storage) is not yet supported(see moby/buildkit#5212). To achieve
this functionality, you can use a development version of BuildKit.
First, set up a BuildKit daemon by running the command:
$ docker buildx create --name "git-lfs" --driver "docker-container" --driver-opt "image=thxcode/buildkit:v0.15.1-git-lfs" --buildkitd-flags "--allow-insecure-entitlement security.insecure --allow-insecure-entitlement network.host" --bootstrap
Next, build and publish your model. By including the --push
argument, the built model will be automatically published
to the Docker registry:
$ export REPO="YOUR_REPOSITORY"
$ docker build --builder git-lfs --tag ${REPO}/qwen2:0.5b-instruct-q5-k-m-demo --load --push $(pwd)
Once the building process is complete, we can utilize gguf-packer
to estimate the model:
$ gguf-packer estimate ${REPO}/qwen2:0.5b-instruct-q5-k-m-demo
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+---------------------------------------------+---------------------------------------+
| ARCH | CONTEXT SIZE | BATCH SIZE (L / P) | FLASH ATTENTION | MMAP LOAD | EMBEDDING ONLY | DISTRIBUTABLE | OFFLOAD LAYERS | FULL OFFLOADED | RAM | VRAM 0 |
| | | | | | | | | +--------------------+-----------+------------+----------------+-----------+----------+
| | | | | | | | | | LAYERS (I + T + O) | UMA | NONUMA | LAYERS (T + O) | UMA | NONUMA |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+-----------+------------+----------------+-----------+----------+
| qwen2 | 8192 | 2048 / 512 | Disabled | Enabled | No | Unsupported | 25 (24 + 1) | Yes | 1 + 0 + 0 | 89.19 MiB | 239.19 MiB | 24 + 1 | 96.58 MiB | 1.03 GiB |
+-------+--------------+--------------------+-----------------+-----------+----------------+---------------+----------------+----------------+--------------------+-----------+------------+----------------+-----------+----------+
You can build the model using various quantization types by setting the QUANTIZE_TYPE
argument:
$ export QUANTIZE_TYPE="Q4_K_M"
$ docker build --builder git-lfs --tag ${REPO}/qwen2:0.5b-instruct-$(echo "${QUANTIZE_TYPE}" | tr '[:upper:]' '[:lower:]' | sed 's/_/-/g')-demo --build-arg QUANTIZE_TYPE=${QUANTIZE_TYPE} --load --push $(pwd)
With build cache, the total build time will be reduced.
You can retrieve the published models from the Docker registry using gguf-packer
:
$ gguf-packer pull ${REPO}/qwen2:0.5b-instruct-q5-k-m-demo
$ gguf-packer pull ${REPO}/qwen2:0.5b-instruct-q4-k-m-demo
$ gguf-packer list
NAME TAG ID ARCH PARAMS BPW TYPE CREATED SIZE
${REPO}/qwen2 0.5b-instruct-q4-k-m-demo a0d46ab8fd9f qwen2 494.03 M 6.35 bpw IQ2_XXS/Q4_K_M 19 minutes ago 379.38 MiB
${REPO}/qwen2 0.5b-instruct-q5-k-m-demo 269bac3c0e20 qwen2 494.03 M 6.71 bpw IQ3_XXS/Q5_K_M 30 minutes ago 400.62 MiB
To run a local model using ghcr.io/ggerganov/llama.cpp, you can utilize the gguf-packer:
$ gguf-packer run ${REPO}/qwen2:0.5b-instruct-q5-k-m-demo -- --flash-attn
You can preview the running command by using the --dry-run
option:
$ gguf-packer run ${REPO}/qwen2:0.5b-instruct-q5-k-m-demo --dry-run -- --flash-attn
docker run --rm --interactive --tty --privileged --publish 8080:8080 --volume ${GGUF_PACKER_STORE_PATH}/models/layers/sha256/269bac3c0e202559a2e75f88d087df3324f95b6aaf108e9e70e8b8895aaa8561:/gp-849d4691 ghcr.io/ggerganov/llama.cpp:server -m /gp-849d4691/Qwen2-0.5B-Instruct.Q5_K_M.gguf -c 8192 --system-prompt-file /gp-849d4691/system-prompt.txt --chat-template "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\\n' + message['content'] + '<|im_end|>' + '\\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\\n' }}{% endif %}" --flash-attn --host 0.0.0.0
Since the ${REPO}/qwen2:0.5b-instruct-q5-k-m-demo
is a standard OCI Artifact, you can refer it using the FROM
instruction in other Dockerfiles.
You can rebuild a model based on Ubuntu:22.04.
$ docker build --builder git-lfs --tag ${REPO}/qwen2:0.5b-instruct-q5-k-m-demo2 --build-arg BASE=ubuntu:22.04 --load --push $(pwd)
To proceed, create a file named Dockerfile.infer
with the following content:
# syntax=docker/dockerfile:1.7-labs
ARG REPO=""
FROM ${REPO}/qwen2:0.5b-instruct-q5-k-m-demo2
RUN apt-get update && \
apt-get install -y libcurl4-openssl-dev libgomp1 curl
ENV LC_ALL=C.utf8
COPY --from=ghcr.io/ggerganov/llama.cpp:server /llama-server /
ENTRYPOINT [ "/llama-server" ]
# reuse model file and system prompt file from the base image
CMD ["-m", "Qwen2-0.5B-Instruct.Q5_K_M.gguf", "-c", "8192", "--system-prompt-file", "system-prompt.txt"]
ARG REPO=...
: The repository of the model image.FROM ...
: The base image for the build.RUN ...
: Install the dependencies.ENV ...
: Set the local.COPY --from=... ...
: Copy the llama-server binary from the llama.cpp image.ENTRYPOINT ...
: Specify the default commands.CMD ...
: Specify the default commands.
Once the Dockerfile.infer
is created, you can build the container image using the following command:
$ docker build --builder git-lfs --tag ${REPO}/qwen2:0.5b-instruct-q5-k-m-demo2-infer --build-arg REPO=${REPO} --file Dockerfile.infer --load $(pwd)
And, you can run the built image with docker run
:
$ docker run --rm --interactive --tty ${REPO}/qwen2:0.5b-instruct-q5-k-m-demo2-infer
GGUFPackerfile
is the preferred file name of the GGUF Packer frontend. It can be simply understood that when a
Dockerfile
is added with a specific syntax, this Dockerfile
is equivalent to GGUFPackerfile
.
Command | Content | |
---|---|---|
Dockerfile |
$ docker build --tag ${TAG} |
# syntax=gpustack/gguf-packer:latest
FROM scratch |
GGUFPackerfile |
|
FROM scratch |
See examples for more details.
GGUF Packer is a high-level languages have been implemented for BuildKit LLB, which supports the following instructions:
Instruction | Description |
---|---|
ADD |
Add local or remote files and directories. |
ARG |
Use build-time variables. Allow one GGUFPackerfile to build various models. |
CAT |
Concatenate content to a file. Be able to create system prompt by hand. |
CMD |
Specify default commands. Declare the main model, drafter, multimodal projector and so on. |
COPY |
Copy files and directories. |
CONVERT |
Convert safetensors model files to a GGUF model file. |
FROM |
Set the base image for the build. |
LABEL |
Add metadata to an image. |
QUANTIZE |
Quantize a GGUF file. |
The format follows the definition of Dockerfile, here is an example:
# Comment
INSTRUCTION arguments
BuildKit treats lines that begin with #
as a comment, unless the line is a
valid parser directive.
When using Dockerfile
file, A # syntax=gpustack/gguf-packer:latest
must add to the top of the file,
see Usage.
The ADD
instruction copies new files or directories from <src>
and adds them to the filesystem of the image at the
path <dest>
. Files and directories can be copied from the build context, a remote URL, or a Git repository.
# syntax=gpustack/gguf-packer:latest
# add from http
ADD https://huggingface.co/QuantFactory/Qwen2-0.5B-Instruct-GGUF/resolve/main/Qwen2-0.5B-Instruct.Q5_K_M.gguf /app/Qwen2-0.5B-Instruct.Q5_K_M.gguf
# add from git repository
ADD https://huggingface.co/Qwen/Qwen2-0.5B-Instruct.git /app/Qwen2-0.5B-Instruct
ADD [--keep-git-dir=<boolean>] <src> ... <dir>
, preserve the.git
directory when adding from a Git repository.ADD [--checksum=<hash>] <src> ... <dir>
, only support HTTP/HTTPS URLs, the checksum is formatted as :. The supported algorithms are sha256, sha384, and sha512.ADD [--chown=<user>:<group>] [--chmod=<perms> ...] <src> ... <dest>
, referring Dockerfile/COPY --chown --chmod.ADD [--link[=<boolean>]] <src> ... <dest>
, referring Dockerfile/COPY --link.ADD [--exclude=<path> ...] <src> ... <dest>
, referring Dockerfile/COPY --exclude.
The ARG
instruction defines a variable that users can pass at build-time to the builder with the docker build
command using the --build-arg <varname>=<value>
flag.
# syntax=gpustack/gguf-packer:latest
ARG REPO=QuantFactory
ARG MODEL=Qwen2-0.5B-Instruct
ARG QUANTIZE_TYPE=Q5_K_M
ADD https://huggingface.co/${REPO}/${MODEL}-GGUF/resolve/main/${MODEL}.${QUANTIZE_TYPE}.gguf /app/${MODEL}.${QUANTIZE_TYPE}.gguf
GGUF Packer supports global ARG
s, which means you can use the same ARG
in multiple stages.
The CAT
instruction allows you to concatenate content to a file.
# syntax=gpustack/gguf-packer:latest
CAT <<EOF /app/system-prompt.txt
You are an AI assistant that follows instruction extremely well. Help as much as you can.
In answering questions, follow these steps:
1. Entity Identification: Identify the main entities involved in the query. Entities can include people, objects, events, or abstract concepts.
2. Relationship Comprehension: Determine the relationships between these entities. Relationships may be explicit in the text or may need to be inferred based on context and general knowledge.
3. Implication Understanding: Understand the general implications of these relationships. These implications can be based on established facts, principles, or rules related to the identified relationships.
4. Question Contextualization: Align the implications of the relationships with the context of the query. This alignment should guide your response to the query.
5. Answer Generation: Based on the understanding of the entities, their relationships, and implications, generate an appropriate response to the query.
EOF
# llama-server like application's arguments
CMD ["-c", "8192", "--system-prompt-file", "/app/system-prompt.txt"]
The CMD
instruction sets the command to be executed. There can only be one CMD
instruction in a Dockerfile. If you
list more than one CMD
, only the last one takes effect.
# syntax=gpustack/gguf-packer:latest
ARG CHAT_TEMPLATE="{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
CMD ["-m", "/app/Qwen2-0.5B-Instruct.Q5_K_M.gguf", "-c", "8192", "--system-prompt-file", "/app/system-prompt.txt", "--chat-template", "${CHAT_TEMPLATE}"]
The COPY
instruction copies new files or directories from <src>
and adds them to the filesystem of the image at the
path <dest>
. Files and directories can be copied from the build context, build stage, named context, or an image.
# syntax=gpustack/gguf-packer:latest
# copy from local
COPY Qwen2-0.5B-Instruct.Q5_K_M.gguf /app/
# copy from other stage
COPY --from=original /app/Qwen2-0.5B-Instruct.Q5_K_M.gguf /app/
COPY [--from=<image|stage|context>] <src> ... <dest>
, by default, theCOPY
instruction copies files from the build context. TheCOPY --from
flag lets you copy files from an image, a build stage, or a named context instead.COPY [--parents[=<boolean>]] <src> ... <dest>
, preserves parent directories for<src>
entries.COPY [--chown=<user>:<group>] [--chmod=<perms> ...] <src> ... <dest>
, referring Dockerfile/COPY --chown --chmod.COPY [--link[=<boolean>]] <src> ... <dest>
, referring Dockerfile/COPY --link.COPY [--exclude=<path> ...] <src> ... <dest>
, referring Dockerfile/COPY --exclude.
The CONVERT
instruction allows you to convert safetensors model files to a GGUF model file.
# syntax=gpustack/gguf-packer:latest
# convert safetensors model files from current stage
ADD https://huggingface.co/Qwen/Qwen2-0.5B-Instruct.git /app/Qwen2-0.5B-Instruct
CONVERT --type=F16 /app/Qwen2-0.5B-Instruct /app/Qwen2-0.5B-Instruct.F16.gguf
# convert from other stage
CONVERT --from=other-stage --type=F16 /app/Qwen2-0.5B-Instruct /app/Qwen2-0.5B-Instruct.F16.gguf
# convert from build context
CONVERT --from=context --type=F16 /app/Qwen2-0.5B-Instruct /app/Qwen2-0.5B-Instruct.F16.gguf
# convert a PEFT LoRA adapter to GGUF file
ADD https://huggingface.co/inflaton/Qwen2-1.5B-MAC-lora.git Qwen2-1.5B-MAC-lora
ADD https://huggingface.co/Qwen/Qwen2-1.5B.git Qwen2-1.5B
CONVERT --type=F16 --class=lora --base=Qwen2-1.5B Qwen2-1.5B-MAC-lora Qwen2-1.5B-MAC-lora.F16.gguf
CONVERT [--from=<image|stage|context>] <src> <dest>
, by default, theCONVERT
instruction converts file from the build context. TheCONVERT --from
flag lets you convert file from an image, a build stage, or a named context instead.CONVERT [--class=<model|lora>] <src> <dest>
, specify the class for the model, default ismodel
.CONVERT --class=lora --base=<path> <src> <dest>
, convert a PEFT LoRA adapter to GGUF file, must provide thebase
model.
CONVERT [--type=<type>] <src> <dest>
, specify the output type for<dest>
, select fromF32
,F16
,BF16
,Q8_0
,TQ1_0
, andTQ2_0
, default isF16
.
The FROM
instruction initializes a new build stage and sets
the base image for subsequent instructions. As such, a valid
Dockerfile must start with a FROM
instruction. The image can be any valid image.
FROM
can appear multiple times within a single Dockerfile to create multiple images or use one build stage as a
dependency for another. Simply make a note of the last image ID output by the commit before each new FROM
instruction.
Each FROM
instruction clears any state created by previous instructions.
# syntax=gpustack/gguf-packer:latest
FROM scratch
# reference another image
FROM thxcode/qwen2:0.5b-instruct-q5-k-m
The LABEL
instruction adds metadata to an image. A LABEL
is a key-value pair. To include spaces within a LABEL
value, use quotes and backslashes as you would in command-line parsing.
# syntax=gpustack/gguf-packer:latest
LABEL org.opencontainers.image.title="Qwen2-0.5B-Instruct" \
org.opencontainers.image.description="Qwen2 0.5B Instruct model" \
org.opencontainers.image.url="https://huggingface.co/Qwen/Qwen2-0.5B-Instruct" \
org.opencontainers.image.source="https://huggingface.co/Qwen/Qwen2-0.5B-Instruct"
Since GGUF format model files will record the general metadata, GGUF Packer can retrieve these values and export them as labels.
gguf.model.architecture
: The architecture of the model.gguf.model.parameters
: The parameters of the model.gguf.model.bpw
: The bits per weight of the model.gguf.model.filetype
: The file type of the model.gguf.model.name
: The name of the model, if specified.gguf.model.vendor
: The vendor of the model, if specified.gguf.model.authors
: The authors of the model, if specified.gguf.model.url
: The URL of the model, if specified.gguf.model.description
: The description of the model, if specified.gguf.model.licenses
: The licenses of the model, if specified.gguf.model.usage
: The usage of the model, default istext-to-text
.
All labels can be overridden by the Dockerfile/GGUFPackerfile.
The QUANTIZE
instruction allows you to quantize a GGUF file.
# syntax=gpustack/gguf-packer:latest
# quantize a GGUF file from current stage
CONVERT --type=F16 /app/Qwen2-0.5B-Instruct /app/Qwen2-0.5B-Instruct.F16.gguf
QUANTIZE --type=Q5_K_M /app/Qwen2-0.5B-Instruct.F16.gguf /app/Qwen2-0.5B-Instruct.Q5_K_M.gguf
# quantize from other stage
QUANTIZE --from=other-stage --type=Q5_K_M /app/Qwen2-0.5B-Instruct.F16.gguf /app/Qwen2-0.5B-Instruct.Q5_K_M.gguf
# quantize from build context
QUANTIZE --from=context --type=Q5_K_M /app/Qwen2-0.5B-Instruct.F16.gguf /app/Qwen2-0.5B-Instruct.Q5_K_M.gguf
QUANTIZE [--from=<image|stage|context>] <src> <dest>
, by default, theQUANTIZE
instruction quantizes file from the build context. TheQUANTIZE --from
flag lets you quantize file from an image, a build stage, or a named context instead.QUANTIZE [--type=<type>] <src> <dest>
, specify the output type for<dest>
, referring llama.cpp/quantize, upper case, default isQ5_K_M
.QUANTIZE [--pure] <src> <dest>
, indicate to disable k-quant mixtures and quantize all tensors to the same type.QUANTIZE [--imatrix=<path>] <src> <dest>
, introduce a file as importance matrix for quant optimizations.QUANTIZE --imatrix=<path> [--include-weights=<tensor_name,...>] <src> <dest>
, specify to use the importance matrix for this/these tensors.QUANTIZE --imatrix=<path> [--exclude-weights=<tensor_name,...>] <src> <dest>
, specify to use the importance matrix, but exclude for this/these tensors.
QUANTIZE [--leave-output-tensor] <src> <dest>
, indicate to not quantize theoutput.weight
tensor.QUANTIZE [--output-tensor-type=<type>] <src> <dest>
, indicate the output tensor type, referring llama.cpp/ggml, upper case.QUANTIZE [--token-embedding-type=<type>] <src> <dest>
, indicate the token embedding type, referring llama.cpp/ggml, upper case.
In the realm of Large Language Model (LLM) world, three projects stand out: GGML, LLaMA.Cpp, and Ollama. LLaMA.Cpp is built on GGML, and Ollama extends LLaMA.Cpp.
GGML presents an alternative for engineers who prefer avoiding Python due to common issues like environment configuration, regional limitations, and installation complexities: a tensor computing library rooted in C/C++. With GGML's quantized model file, the GGUF format, GGML empowers edge devices to run LLMs efficiently.
LLaMa.cpp encapsulates various prominent LLM architectures and, with its flagship llama-server, enables users to set up an OpenAI GPT-like service on their laptops with ease.
However, managing a Chat LLM service involves numerous parameters and model file management challenges. Ollama addresses this by introducing a Model File that facilitates the distribution of the model file and its parameters, much like a Dockerfile.
While the Ollama Model File is a Dockerfile-like tool for building and distributing Ollama models only, it does not align well with the Cloud Native ecosystem. Let's explore the reasons why.
Take, for example, the renowned DockerHub registry's alpine image. We can retrieve its manifest using crane:
$ crane manifest docker.io/library/alpine:latest | jq .
{
"manifests": [
{
"digest": "sha256:eddacbc7e24bf8799a4ed3cdcfa50d4b88a323695ad80f317b6629883b2c2a78",
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"platform": {
"architecture": "amd64",
"os": "linux"
},
"size": 528
},
{
"digest": "sha256:5c7e326e3c8a8c51654a6c5d94dac98d7f6fc4b2a762d86aaf67b7e76a6aee46",
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"platform": {
"architecture": "arm",
"os": "linux",
"variant": "v6"
},
"size": 528
},
...
],
"mediaType": "application/vnd.docker.distribution.manifest.list.v2+json",
"schemaVersion": 2
}
The mediaType
of alpine:latest
image manifest
is application/vnd.docker.distribution.manifest.list.v2+json
, indicating a manifest list for multiple platforms.
For OCI compatibility,
the corresponding mediaType
is
application/vnd.oci.image.index.v1+json
.
Delving deeper into the linux/amd64
platform manifest for alpine:latest
:
$ crane manifest docker.io/library/alpine@sha256:eddacbc7e24bf8799a4ed3cdcfa50d4b88a323695ad80f317b6629883b2c2a78 | jq .
{
"schemaVersion": 2,
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"config": {
"mediaType": "application/vnd.docker.container.image.v1+json",
"size": 1471,
"digest": "sha256:324bc02ae1231fd9255658c128086395d3fa0aedd5a41ab6b034fd649d1a9260"
},
"layers": [
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"size": 3622892,
"digest": "sha256:c6a83fedfae6ed8a4f5f7cbb6a7b6f1c1ec3d86fea8cb9e5ba2e5e6673fde9f6"
}
]
}
Here, the mediaType
is application/vnd.docker.distribution.manifest.v2+json
, which translates
to
application/vnd.oci.image.manifest.v1+json
for OCI compatibility.
The manifest includes a special config
field, referencing the image configuration as a JSON object detailing the
image's settings.
$ crane blob docker.io/library/alpine@sha256:324bc02ae1231fd9255658c128086395d3fa0aedd5a41ab6b034fd649d1a9260 | jq .
{
"architecture": "amd64",
"config": {
"Hostname": "",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
],
"Cmd": [
"/bin/sh"
],
"Image": "sha256:3e25db883ea289c8b0d3006e7a6a03c56be09c50f03f6b228ba2fe74fd8194d2",
"Volumes": null,
"WorkingDir": "",
"Entrypoint": null,
"OnBuild": null,
"Labels": null
},
"container": "fc33bc50844a0d6cdfc682fcf85647ec60783decbd3850e58ab7e013ef892cfb",
"container_config": {...},
"created": "2024-07-22T22:26:43.778747613Z",
"docker_version": "23.0.11",
"history": [...],
"os": "linux",
"rootfs": {...}
}
The OCI Distribution Specification defines a standard for image registries that store and serve images. Examples of OCI Registries include Docker Registry, GitHub Container Registry, Harbor, Quay, Azure Container Registry, Google Container Registry.
Initially designed for storing container images, OCI Registries now also support OCI Artifacts, with Helm charts being a prime example. Helm charts, once managed through Git repositories or released independently, can now be distributed as OCI Artifacts, streamlining operations to a single OCI registry management task.
We can use crane
to retrieve a Helm chart's manifest and download it as below:
$ crane manifest ghcr.io/argoproj/argo-helm/argo-cd:7.3.11 | jq .
{
"schemaVersion": 2,
"config": {
"mediaType": "application/vnd.cncf.helm.config.v1+json",
"digest": "sha256:42242c5441612b0cedb4cfc87ad5c257ec062ff6fab8c27557a072739eff0d71",
"size": 940
},
"layers": [
{
"mediaType": "application/vnd.cncf.helm.chart.provenance.v1.prov",
"digest": "sha256:2920df17e16b736156075e5859b7cb09d127d91bcaefdafd63860fb775609df9",
"size": 1870
},
{
"mediaType": "application/vnd.cncf.helm.chart.content.v1.tar+gzip",
"digest": "sha256:4249ea76c915bb04f4dda095e608004c08f13a7d0e0da2d1836ffc57a8592f7b",
"size": 168713
}
]
}
$ crane pull ghcr.io/argoproj/argo-helm/argo-cd:7.3.11 argo-cd.tar
$ tar xf argo-cd.tar
$ ls -alth .
total 760
drwxr-xr-x 7 gpustack wheel 224B Jul 26 13:26 .
drwxrwxrwt 53 root wheel 1.7K Jul 26 13:26 ..
-rw-r--r--@ 1 gpustack wheel 172K Jul 26 13:23 argo-cd.tar
-rw-r--r-- 1 gpustack wheel 1.8K Jan 1 1970 2920df17e16b736156075e5859b7cb09d127d91bcaefdafd63860fb775609df9.tar.gz
-rw-r--r-- 1 gpustack wheel 165K Jan 1 1970 4249ea76c915bb04f4dda095e608004c08f13a7d0e0da2d1836ffc57a8592f7b.tar.gz
-rw-r--r-- 1 gpustack wheel 302B Jan 1 1970 manifest.json
-rw-r--r-- 1 gpustack wheel 940B Jan 1 1970 sha256:42242c5441612b0cedb4cfc87ad5c257ec062ff6fab8c27557a072739eff0d71
Examining the Ollama model, specifically the LLaMa3.1:8B model, we initially assumed it conformed to the standard OCI Registry.
However, attempts to retrieve its manifest with crane
resulted in a 404 error, indicating non-compliance with OCI
standards.
$ crane manifest ollama.com/library/llama3.1:8b | jq .
Error: fetching manifest ollama.com/library/llama3.1:8b: GET https://ollama.com/v2/: unexpected status code 404 Not Found: 404 page not found
$ curl https://ollama.com/v2/library/llama3.1/manifests/8b | jq .
{
"schemaVersion": 2,
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"config": {
"digest": "sha256:e711233e734332fe5f8a09b2407fb5a083e39ca7e0ba90788026414cd4c059af",
"mediaType": "application/vnd.docker.container.image.v1+json",
"size": 485
},
"layers": [
{
"digest": "sha256:87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87",
"mediaType": "application/vnd.ollama.image.model",
"size": 4661211808
},
{
"digest": "sha256:8cf247399e57085e6b34c345ebea38c1aa3e2b25c8294eecb746dd7b01dd9079",
"mediaType": "application/vnd.ollama.image.template",
"size": 1692
},
{
"digest": "sha256:f1cd752815fcf68c3c2e73b2b00b5396c5dffb9eebe49567573f275f9ec85fcd",
"mediaType": "application/vnd.ollama.image.license",
"size": 12321
},
{
"digest": "sha256:56bb8bd477a519ffa694fc449c2413c6f0e1d3b1c88fa7e3c9d88d3ae49d4dcb",
"mediaType": "application/vnd.ollama.image.params",
"size": 96
}
]
}
$ crane blob ollama.com/library/llama3.1@sha256:e711233e734332fe5f8a09b2407fb5a083e39ca7e0ba90788026414cd4c059af | jq .
Error: pulling layer ollama.com/library/llama3.1@sha256:e711233e734332fe5f8a09b2407fb5a083e39ca7e0ba90788026414cd4c059af: GET https://ollama.com/v2/: unexpected status code 404 Not Found: 404 page not found
$ curl https://ollama.com/v2/library/llama3.1/blobs/sha256:e711233e734332fe5f8a09b2407fb5a083e39ca7e0ba90788026414cd4c059af
<a href="https://dd20bb891979d25aebc8bec07b2b3bbc.r2.cloudflarestorage.com/ollama/docker/registry/v2/blobs/sha256/e7/e711233e734332fe5f8a09b2407fb5a083e39ca7e0ba90788026414cd4c059af/data?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=66040c77ac1b787c3af820529859349a%2F20240725%2Fauto%2Fs3%2Faws4_request&X-Amz-Date=20240725T145758Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=262290faed14709a9c0faf9f8b8d567d3fec84731b2877852d275bc979f0530f">Temporary Redirect</a>.
$ curl -L https://ollama.com/v2/library/llama3.1/blobs/sha256:e711233e734332fe5f8a09b2407fb5a083e39ca7e0ba90788026414cd4c059af
{"model_format":"gguf","model_family":"llama","model_families":["llama"],"model_type":"8.0B","file_type":"Q4_0","architecture":"amd64","os":"linux","rootfs":{"type":"layers","diff_ids":["sha256:87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87","sha256:11ce4ee3e170f6adebac9a991c22e22ab3f8530e154ee669954c4bc73061c258","sha256:f1cd752815fcf68c3c2e73b2b00b5396c5dffb9eebe49567573f275f9ec85fcd","sha256:56bb8bd477a519ffa694fc449c2413c6f0e1d3b1c88fa7e3c9d88d3ae49d4dcb"]}}
The Ollama model's distribution method is not a good practice. Why?
Helm chart packages are usually very small, and the network cost of deploying OCI in a nearby network is very low, so we don't need to disguise the Helm chart as a container image.
However, large model files, such as the 4.3GB LLaMa3.1:8B model, incur significant network costs when distributed without compression.
Moreover, we can see many Ollama pre-download images in DockerHub, which is inefficient, wasting storage and network resources.
In conclusion, while Ollama has gained popularity in managing LLM distributions, its approach diverges from best practices for OCI Artifacts. GGUF Packer, on the other hand, offers a contemporary solution that adheres to OCI standards, reducing both network and storage overhead.
MIT