Skip to content

feat: Add scripts for kubernetes dev env using vLLM and vLLM-p2p #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 44 additions & 11 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,20 +178,40 @@ Export the name of the `Secret` to the environment:
export REGISTRY_SECRET=anna-pull-secret
```

Now you need to provide several other environment variables. You'll need to
indicate the location and tag of the `vllm-sim` image:
You can optionally set a custom EPP image (otherwise, the default will be used):

```console
export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
export EPP_TAG="<YOUR_TAG>"
```

Set the `VLLM_MODE` environment variable based on which version of vLLM you want to deploy:

- `vllm-sim`: Lightweight simulator for simple environments
- `vllm`: Full vLLM model server for real inference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `vllm`: Full vLLM model server for real inference
- `vllm`: Full vLLM model server, using GPU/CPU for inferencing

- `vllm-p2p`: Full vLLM with LMCache P2P support for enable KV-Cache aware routing

```console
export VLLM_MODE=vllm-sim # or vllm / vllm-p2p
```
Each mode has default image values, but you can override them:

For vllm-sim:

```console
export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the images being set elsewhere, to match simulator/vLLM/vLLM-p2p?

export VLLM_SIM_TAG="<YOUR_TAG>"
```

The same thing will need to be done for the EPP:

For vllm and vllm-p2p:
- set Vllm image:
```console
export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
export EPP_TAG="<YOUR_TAG>"
export VLLM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
export VLLM_TAG="<YOUR_TAG>"
```
- Set hugging face token variable:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Set hugging face token variable:
- Set Hugging Face token variable:

export HF_TOKEN="<HF_TOKEN>"
**Warning**: For vllm mode, the default image uses llama3-8b and vllm-mistral. Make sure you have permission to access these files in their respective repositories.

Once all this is set up, you can deploy the environment:

Expand All @@ -203,16 +223,29 @@ This will deploy the entire stack to whatever namespace you chose. You can test
by exposing the inference `Gateway` via port-forward:

```console
kubectl -n ${NAMESPACE} port-forward service/inference-gateway-istio 8080:80
kubectl -n ${NAMESPACE} port-forward service/inference-gateway 8080:80
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is -n still needed if you apply the kubectl config ... statement?
I know it won't hurt, but we should ensure this looks like a more cohesive end to end doc

```

And making requests with `curl`:
- vllm-sim

```console
curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
-d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jq
```
```console
curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
-d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jq
```

- vllm

```console
curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"hi","max_tokens":10,"temperature":0}' | jq
```

- vllm-p2p
```console
curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
-d '{"model":"mistralai/Mistral-7B-Instruct-v0.2","prompt":"hi","max_tokens":10,"temperature":0}' | jq
```
#### Development Cycle

> **WARNING**: This is a very manual process at the moment. We expect to make
Expand Down
7 changes: 2 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -780,11 +780,8 @@ environment.dev.kubernetes: check-kubectl check-kustomize check-envsubst
# ------------------------------------------------------------------------------
.PHONY: clean.environment.dev.kubernetes
clean.environment.dev.kubernetes: check-kubectl check-kustomize check-envsubst
ifndef NAMESPACE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check no longer needed?
NAMESPACE is used some lines below (in the information message)

$(error "Error: NAMESPACE is required but not set")
endif
@echo "INFO: cleaning up dev environment in $(NAMESPACE)"
kustomize build deploy/environments/dev/kubernetes-kgateway | envsubst | kubectl -n "${NAMESPACE}" delete -f -
@CLEAN=true ./scripts/kubernetes-dev-env.sh 2>&1
@echo "INFO: Finish cleanup development environment for $(VLLM_MODE) mode in namespace $(NAMESPACE)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@echo "INFO: Finish cleanup development environment for $(VLLM_MODE) mode in namespace $(NAMESPACE)"
@echo "INFO: Finished cleanup of development environment for $(VLLM_MODE) mode in namespace $(NAMESPACE)"


# -----------------------------------------------------------------------------
# TODO: these are old aliases that we still need for the moment, but will be
Expand Down
8 changes: 8 additions & 0 deletions deploy/components/inference-gateway/deployments.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,11 @@ spec:
service: inference-extension
initialDelaySeconds: 5
periodSeconds: 10
env:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this depend on VLLM_MODE?
Will the Pods come up if HF_SECRET_* are not defined (e.g., when using simulator)?

- name: KVCACHE_INDEXER_REDIS_ADDR
value: ${REDIS_HOST}:${REDIS_PORT}
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: ${HF_SECRET_NAME}
key: ${HF_SECRET_KEY}
32 changes: 31 additions & 1 deletion deploy/components/inference-gateway/inference-models.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,37 @@ spec:
modelName: food-review
criticality: Critical
poolRef:
name: vllm-llama3-8b-instruct
name: ${POOL_NAME}
targetModels:
- name: food-review
weight: 100
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: base-model
spec:
modelName: meta-llama/Llama-3.1-8B-Instruct
criticality: Critical
poolRef:
name: ${POOL_NAME}
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: base-model-cpu
spec:
modelName: Qwen/Qwen2.5-1.5B-Instruct
criticality: Critical
poolRef:
name: ${POOL_NAME}
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: mistarli
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: mistarli
name: mistral

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I removed it and use just a base model

spec:
modelName: mistralai/Mistral-7B-Instruct-v0.2
criticality: Critical
poolRef:
name: ${POOL_NAME}
4 changes: 2 additions & 2 deletions deploy/components/inference-gateway/inference-pools.yaml
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: vllm-llama3-8b-instruct
name: ${POOL_NAME}
spec:
targetPortNumber: 8000
selector:
app: vllm-llama3-8b-instruct
app: ${POOL_NAME}
extensionRef:
name: endpoint-picker
55 changes: 55 additions & 0 deletions deploy/components/vllm-p2p/deployments/redis-deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: ${REDIS_SVC_NAME}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be turned to REDIS_DEPLOYMENT_NAME or something like that, and in the Service CR add a -service extension. Otherwise the deployment's name will be weird.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

labels:
app.kubernetes.io/name: redis
app.kubernetes.io/component: redis-lookup-server
spec:
replicas: ${REDIS_REPLICA_COUNT}
selector:
matchLabels:
app.kubernetes.io/name: redis
app.kubernetes.io/component: redis-lookup-server
template:
metadata:
labels:
app.kubernetes.io/name: redis
app.kubernetes.io/component: redis-lookup-server
spec:
containers:
- name: lookup-server
image: ${REDIS_IMAGE}:${REDIS_TAG}
imagePullPolicy: IfNotPresent
command:
- redis-server
ports:
- name: redis-port
containerPort: ${REDIS_TARGET_PORT}
protocol: TCP
resources:
limits:
cpu: "4"
memory: 10G
requests:
cpu: "4"
memory: 8G
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
securityContext: {}
schedulerName: default-scheduler
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
revisionHistoryLimit: 10
progressDeadlineSeconds: 600
# securityContext:
# allowPrivilegeEscalation: false
# capabilities:
# drop:
# - ALL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's helpful for future readers to see this commented out bit, we should add a comment explaining why. Otherwise we should probably just remove it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

11 changes: 11 additions & 0 deletions deploy/components/vllm-p2p/deployments/secret.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: v1
kind: Secret
metadata:
name: ${HF_SECRET_NAME}
namespace: ${NAMESPACE}
labels:
app.kubernetes.io/name: vllm
app.kubernetes.io/component: secret
type: Opaque
data:
${HF_SECRET_KEY}: ${HF_TOKEN}
123 changes: 123 additions & 0 deletions deploy/components/vllm-p2p/deployments/vllm-deployment.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: ${VLLM_DEPLOYMENT_NAME}
labels:
app.kubernetes.io/name: vllm
app.kubernetes.io/model: ${MODEL_LABEL}
app.kubernetes.io/component: vllm
spec:
replicas: ${VLLM_REPLICA_COUNT}
selector:
matchLabels:
app.kubernetes.io/name: vllm
app.kubernetes.io/component: vllm
app.kubernetes.io/model: ${MODEL_LABEL}
app: ${POOL_NAME}
template:
metadata:
labels:
app.kubernetes.io/name: vllm
app.kubernetes.io/component: vllm
app.kubernetes.io/model: ${MODEL_LABEL}
app: ${POOL_NAME}
spec:
# securityContext:
# runAsUser: ${PROXY_UID}
# runAsNonRoot: true
# seccompProfile:
# type: RuntimeDefault
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above: if this is an important breadcrumb for future readers, let's add a comment explaining why, otherwise if its just leftovers let's remove it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

containers:
- name: vllm
image: ${VLLM_IMAGE}:${VLLM_TAG}
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- "-c"
args:
- |
export LMCACHE_DISTRIBUTED_URL=$${${POD_IP}}:80 && \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this syntax correct? Shouldnt it be:

Suggested change
export LMCACHE_DISTRIBUTED_URL=$${${POD_IP}}:80 && \
export LMCACHE_DISTRIBUTED_URL=${POD_IP}:80 && \

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, you are right, the end goal is that it will be ${POD_IP},
But the issue is I use envsubst, so I need to change it to this =$${${POD_IP}} + export POD_IP="POD_IP",
That after I trigger envsubst, I will get ${POD_IP} (took me a lot of time to figure it out)

vllm serve ${MODEL_NAME} \
--host 0.0.0.0 \
--port 8000 \
--enable-chunked-prefill false \
--max-model-len ${MAX_MODEL_LEN} \
--kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_both"}'
ports:
- name: http
containerPort: 8000
protocol: TCP
- name: lmcache-dist # Assuming port 80 is used for LMCACHE_DISTRIBUTED_URL
containerPort: 80
protocol: TCP
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
startupProbe:
failureThreshold: 60
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 15
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
env:
- name: HF_HOME
value: /data
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: ${HF_SECRET_NAME}
key: ${HF_SECRET_KEY}
- name: LMCACHE_LOOKUP_URL
value: ${REDIS_HOST}:${REDIS_PORT}
- name: LMCACHE_ENABLE_DEBUG
value: "True"
- name: LMCACHE_ENABLE_P2P
value: "True"
- name: LMCACHE_LOCAL_CPU
value: "True"
- name: LMCACHE_MAX_LOCAL_CPU_SIZE
value: "20"
- name: LMCACHE_USE_EXPERIMENTAL
value: "True"
- name: VLLM_RPC_TIMEOUT
value: "1000000"
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: "10"
memory: 40Gi
nvidia.com/gpu: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
securityContext:
runAsNonRoot: false
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
securityContext: {}
schedulerName: default-scheduler
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: "100%"
revisionHistoryLimit: 10
progressDeadlineSeconds: 1200

18 changes: 18 additions & 0 deletions deploy/components/vllm-p2p/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
apiVersion: kustomize.config.k8s.io/v1beta1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed we added some documentation on the kustomization.yaml for the vllm component, but not for this one. Perhaps we should add a little comment here explaining how this one differs from the standard one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

kind: Kustomization

namespace: ${NAMESPACE}

resources:
- deployments/vllm-deployment.yaml
- deployments/redis-deployment.yaml
- service/redis-service.yaml
- deployments/secret.yaml

images:
- name: vllm/vllm-openai
newName: ${VLLM_IMAGE}
newTag: ${VLLM_TAG}
- name: redis
newName: ${REDIS_IMAGE}
newTag: ${REDIS_TAG}
17 changes: 17 additions & 0 deletions deploy/components/vllm-p2p/service/redis-service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
apiVersion: v1
kind: Service
metadata:
name: ${REDIS_SVC_NAME}
labels:
app.kubernetes.io/name: redis
app.kubernetes.io/component: redis-lookup-server
spec:
ports:
- name: lookupserver-port
protocol: TCP
port: ${REDIS_PORT}
targetPort: ${REDIS_TARGET_PORT}
type: ${REDIS_SERVICE_TYPE}
selector:
app.kubernetes.io/name: redis
app.kubernetes.io/component: redis-lookup-server
Loading