Skip to content

feat: Add scripts for kubernetes dev env using vLLM and vLLM-p2p #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 103 additions & 28 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ serving resources.

Run the following:

```console
```bash
make environment.dev.kind
```

Expand All @@ -48,13 +48,15 @@ namespace.
There are several ways to access the gateway:

**Port forward**:

```sh
$ kubectl --context kind-gie-dev port-forward service/inference-gateway 8080:80
```

**NodePort `inference-gateway-istio`**
> **Warning**: This method doesn't work on `podman` correctly, as `podman` support
> with `kind` is not fully implemented yet.

```sh
# Determine the k8s node address
$ kubectl --context kind-gie-dev get node -o yaml | grep address
Expand All @@ -80,9 +82,10 @@ By default the created inference gateway, can be accessed on port 30080. This ca
be overriden to any free port in the range of 30000 to 32767, by running the above
command as follows:

```console
```bash
GATEWAY_HOST_PORT=<selected-port> make environment.dev.kind
```

**Where:** &lt;selected-port&gt; is the port on your local machine you want to use to
access the inference gatyeway.

Expand All @@ -96,7 +99,7 @@ access the inference gatyeway.
To test your changes to the GIE in this environment, make your changes locally
and then run the following:

```console
```bash
make environment.dev.kind.update
```

Expand All @@ -122,7 +125,7 @@ the `default` namespace if the cluster is private/personal).
The following will deploy all the infrastructure-level requirements (e.g. CRDs,
Operators, etc) to support the namespace-level development environments:

```console
```bash
make environment.dev.kubernetes.infrastructure
```

Expand All @@ -140,7 +143,7 @@ To deploy a development environment to the cluster you'll need to explicitly
provide a namespace. This can be `default` if this is your personal cluster,
but on a shared cluster you should pick something unique. For example:

```console
```bash
export NAMESPACE=annas-dev-environment
```

Expand All @@ -149,10 +152,18 @@ export NAMESPACE=annas-dev-environment

Create the namespace:

```console
```bash
kubectl create namespace ${NAMESPACE}
```

Set the default namespace for kubectl commands

```bash
kubectl config set-context --current --namespace="${NAMESPACE}"
```

> NOTE: If you are using OpenShift (oc CLI), use the following instead: `oc project "${NAMESPACE}"`

You'll need to provide a `Secret` with the login credentials for your private
repository (e.g. quay.io). It should look something like this:

Expand All @@ -168,51 +179,115 @@ type: kubernetes.io/dockerconfigjson

Apply that to your namespace:

```console
kubectl -n ${NAMESPACE} apply -f secret.yaml
```bash
kubectl apply -f secret.yaml
```

Export the name of the `Secret` to the environment:

```console
```bash
export REGISTRY_SECRET=anna-pull-secret
```

Now you need to provide several other environment variables. You'll need to
indicate the location and tag of the `vllm-sim` image:
Set the `VLLM_MODE` environment variable based on which version of vLLM you want to deploy:

```console
export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
export VLLM_SIM_TAG="<YOUR_TAG>"
* `vllm-sim`: Lightweight simulator for simple environments (defult).
* `vllm`: Full vLLM model server, using GPU/CPU for inferencing
* `vllm-p2p`: Full vLLM with LMCache P2P support for enable KV-Cache aware routing

```bash
export VLLM_MODE=vllm-sim # or vllm / vllm-p2p
```

The same thing will need to be done for the EPP:
- Set Hugging Face token variable:

```console
export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
export EPP_TAG="<YOUR_TAG>"
```bash
export HF_TOKEN="<HF_TOKEN>"
```

**Warning**: For vllm mode, the default image uses llama3-8b. Make sure you have permission to access these files in their respective repositories.

**Note:** The model can be replaced. See [Environment Configuration](#environment-configuration) for model settings.

Once all this is set up, you can deploy the environment:

```console
```bash
make environment.dev.kubernetes
```

This will deploy the entire stack to whatever namespace you chose. You can test
by exposing the inference `Gateway` via port-forward:

```console
kubectl -n ${NAMESPACE} port-forward service/inference-gateway-istio 8080:80
```bash
kubectl port-forward service/inference-gateway 8080:80
```

And making requests with `curl`:

```console
curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
-d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jq
- vllm-sim

```bash
curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
-d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jq
```

- vllm or vllm-p2p

```bash
curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"hi","max_tokens":10,"temperature":0}' | jq
```

#### Environment Configurateion

**1. Setting the EPP image and tag:**

You can optionally set a custom EPP image (otherwise, the default will be used):

```bash
export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
export EPP_TAG="<YOUR_TAG>"
```

**2. Setting the vLLM image and tag:**

Each vLLM mode has default image values, but you can override them:

For `vllm-sim` mode:**

```bash
export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
export VLLM_SIM_TAG="<YOUR_TAG>"
```

For `vllm` and `vllm-p2p` modes:**

```bash
export VLLM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
export VLLM_TAG="<YOUR_TAG>"
```

**3. Setting the model name and label:**

You can replace the model name that will be used in the system.

```bash
export MODEL_NAME="${MODEL_NAME:-mistralai/Mistral-7B-Instruct-v0.2}"
export MODEL_LABEL="${MODEL_LABEL:-mistral7b}"
```

It is also recommended to update the inference pool name accordingly so that it aligns with the models:

```bash
export POOL_NAME="${POOL_NAME:-vllm-Mistral-7B-Instruct}"
```

**4. Additional environment settings:**

More Setting of environment variables can be found in the `scripts/kubernetes-dev-env.sh`.



#### Development Cycle

> **WARNING**: This is a very manual process at the moment. We expect to make
Expand All @@ -221,19 +296,19 @@ curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: applicati
Make your changes locally and commit them. Then select an image tag based on
the `git` SHA:

```console
```bash
export EPP_TAG=$(git rev-parse HEAD)
```

Build the image:

```console
```bash
DEV_VERSION=$EPP_TAG make image-build
```

Tag the image for your private registry and push it:

```console
```bash
$CONTAINER_RUNTIME tag quay.io/vllm-d/gateway-api-inference-extension/epp:$TAG \
<MY_REGISTRY>/<MY_IMAGE>:$EPP_TAG
$CONTAINER_RUNTIME push <MY_REGISTRY>/<MY_IMAGE>:$EPP_TAG
Expand All @@ -245,7 +320,7 @@ $CONTAINER_RUNTIME push <MY_REGISTRY>/<MY_IMAGE>:$EPP_TAG
Then you can re-deploy the environment with the new changes (don't forget all
the required env vars):

```console
```bash
make environment.dev.kubernetes
```

Expand Down
7 changes: 2 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -780,11 +780,8 @@ environment.dev.kubernetes: check-kubectl check-kustomize check-envsubst
# ------------------------------------------------------------------------------
.PHONY: clean.environment.dev.kubernetes
clean.environment.dev.kubernetes: check-kubectl check-kustomize check-envsubst
ifndef NAMESPACE
$(error "Error: NAMESPACE is required but not set")
endif
@echo "INFO: cleaning up dev environment in $(NAMESPACE)"
kustomize build deploy/environments/dev/kubernetes-kgateway | envsubst | kubectl -n "${NAMESPACE}" delete -f -
@CLEAN=true ./scripts/kubernetes-dev-env.sh 2>&1
@echo "INFO: Finished cleanup of development environment for $(VLLM_MODE) mode in namespace $(NAMESPACE)"

# -----------------------------------------------------------------------------
# TODO: these are old aliases that we still need for the moment, but will be
Expand Down
10 changes: 9 additions & 1 deletion deploy/components/inference-gateway/deployments.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ spec:
imagePullPolicy: IfNotPresent
args:
- -poolName
- "vllm-llama3-8b-instruct"
- "${POOL_NAME}"
- -v
- "4"
- --zap-encoder
Expand All @@ -48,3 +48,11 @@ spec:
service: inference-extension
initialDelaySeconds: 5
periodSeconds: 10
env:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this depend on VLLM_MODE?
Will the Pods come up if HF_SECRET_* are not defined (e.g., when using simulator)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For VLLM-SIM, it’s not needed, so I generate a dummy key in the script (you don’t need to define it).
My issue is that I separated the gateway creation from the vLLM creation - I think it’s cleaner this way.
This is the only dependency we have, so I don’t mind if all versions of EPP receive all the variables and choose whether to use them or not.

- name: KVCACHE_INDEXER_REDIS_ADDR
value: ${REDIS_HOST}:${REDIS_PORT}
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: ${HF_SECRET_NAME}
key: ${HF_SECRET_KEY}
2 changes: 1 addition & 1 deletion deploy/components/inference-gateway/httproutes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ spec:
backendRefs:
- group: inference.networking.x-k8s.io
kind: InferencePool
name: vllm-llama3-8b-instruct
name: ${POOL_NAME}
port: 8000
timeouts:
request: 30s
12 changes: 11 additions & 1 deletion deploy/components/inference-gateway/inference-models.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,17 @@ spec:
modelName: food-review
criticality: Critical
poolRef:
name: vllm-llama3-8b-instruct
name: ${POOL_NAME}
targetModels:
- name: food-review
weight: 100
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: base-model
spec:
modelName: ${MODEL_NAME}
criticality: Critical
poolRef:
name: ${POOL_NAME}
4 changes: 2 additions & 2 deletions deploy/components/inference-gateway/inference-pools.yaml
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: vllm-llama3-8b-instruct
name: ${POOL_NAME}
spec:
targetPortNumber: 8000
selector:
app: vllm-llama3-8b-instruct
app: ${POOL_NAME}
extensionRef:
name: endpoint-picker
2 changes: 2 additions & 0 deletions deploy/components/inference-gateway/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ resources:
- deployments.yaml
- gateways.yaml
- httproutes.yaml
- secret.yaml


images:
- name: quay.io/vllm-d/gateway-api-inference-extension/epp
Expand Down
10 changes: 10 additions & 0 deletions deploy/components/inference-gateway/secret.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
apiVersion: v1
kind: Secret
metadata:
name: ${HF_SECRET_NAME}
labels:
app.kubernetes.io/name: vllm
app.kubernetes.io/component: secret
type: Opaque
data:
${HF_SECRET_KEY}: ${HF_TOKEN}
32 changes: 32 additions & 0 deletions deploy/components/vllm-p2p/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# ------------------------------------------------------------------------------
# vLLM P2P Deployment
#
# This deploys the full vLLM model server, capable of serving real models such
# as Llama 3.1-8B-Instruct via the OpenAI-compatible API. It is intended for
# environments with GPU resources and where full inference capabilities are
# required.
# in additon it add LMcache a LLM serving engine extension using Redis to vLLM image
#
# The deployment can be customized using environment variables to set:
# - The container image and tag (VLLM_IMAGE, VLLM_TAG)
# - The model to load (MODEL_NAME)
#
# This setup is suitable for testing on Kubernetes (including
# GPU-enabled nodes or clusters with scheduling for `nvidia.com/gpu`).
# -----------------------------------------------------------------------------
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- vllm-deployment.yaml
- redis-deployment.yaml
- redis-service.yaml
- secret.yaml

images:
- name: vllm/vllm-openai
newName: ${VLLM_IMAGE}
newTag: ${VLLM_TAG}
- name: redis
newName: ${REDIS_IMAGE}
newTag: ${REDIS_TAG}
Loading