Skip to content

Commit f67cc34

Browse files
authored
feat: Add scripts for kubernetes dev env using vLLM and vLLM-p2p (#60)
* [feat]: Add scripts for Kubernetes dev env using vLLM and vLLM-p2p (setup for kvcache-aware) Signed-off-by: Kfir Toledo <[email protected]>
1 parent 54c73a6 commit f67cc34

31 files changed

+750
-83
lines changed

DEVELOPMENT.md

+101-26
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ serving resources.
3737
3838
Run the following:
3939

40-
```console
40+
```bash
4141
make environment.dev.kind
4242
```
4343

@@ -48,13 +48,15 @@ namespace.
4848
There are several ways to access the gateway:
4949

5050
**Port forward**:
51+
5152
```sh
5253
$ kubectl --context kind-gie-dev port-forward service/inference-gateway 8080:80
5354
```
5455

5556
**NodePort `inference-gateway-istio`**
5657
> **Warning**: This method doesn't work on `podman` correctly, as `podman` support
5758
> with `kind` is not fully implemented yet.
59+
5860
```sh
5961
# Determine the k8s node address
6062
$ kubectl --context kind-gie-dev get node -o yaml | grep address
@@ -80,9 +82,10 @@ By default the created inference gateway, can be accessed on port 30080. This ca
8082
be overriden to any free port in the range of 30000 to 32767, by running the above
8183
command as follows:
8284

83-
```console
85+
```bash
8486
GATEWAY_HOST_PORT=<selected-port> make environment.dev.kind
8587
```
88+
8689
**Where:** &lt;selected-port&gt; is the port on your local machine you want to use to
8790
access the inference gatyeway.
8891

@@ -96,7 +99,7 @@ access the inference gatyeway.
9699
To test your changes to the GIE in this environment, make your changes locally
97100
and then run the following:
98101

99-
```console
102+
```bash
100103
make environment.dev.kind.update
101104
```
102105

@@ -122,7 +125,7 @@ the `default` namespace if the cluster is private/personal).
122125
The following will deploy all the infrastructure-level requirements (e.g. CRDs,
123126
Operators, etc) to support the namespace-level development environments:
124127

125-
```console
128+
```bash
126129
make environment.dev.kubernetes.infrastructure
127130
```
128131

@@ -140,7 +143,7 @@ To deploy a development environment to the cluster you'll need to explicitly
140143
provide a namespace. This can be `default` if this is your personal cluster,
141144
but on a shared cluster you should pick something unique. For example:
142145

143-
```console
146+
```bash
144147
export NAMESPACE=annas-dev-environment
145148
```
146149

@@ -149,10 +152,18 @@ export NAMESPACE=annas-dev-environment
149152
150153
Create the namespace:
151154

152-
```console
155+
```bash
153156
kubectl create namespace ${NAMESPACE}
154157
```
155158

159+
Set the default namespace for kubectl commands
160+
161+
```bash
162+
kubectl config set-context --current --namespace="${NAMESPACE}"
163+
```
164+
165+
> NOTE: If you are using OpenShift (oc CLI), use the following instead: `oc project "${NAMESPACE}"`
166+
156167
You'll need to provide a `Secret` with the login credentials for your private
157168
repository (e.g. quay.io). It should look something like this:
158169

@@ -168,51 +179,115 @@ type: kubernetes.io/dockerconfigjson
168179
169180
Apply that to your namespace:
170181
171-
```console
172-
kubectl -n ${NAMESPACE} apply -f secret.yaml
182+
```bash
183+
kubectl apply -f secret.yaml
173184
```
174185

175186
Export the name of the `Secret` to the environment:
176187

177-
```console
188+
```bash
178189
export REGISTRY_SECRET=anna-pull-secret
179190
```
180191

181-
Now you need to provide several other environment variables. You'll need to
182-
indicate the location and tag of the `vllm-sim` image:
192+
Set the `VLLM_MODE` environment variable based on which version of vLLM you want to deploy:
183193

184-
```console
185-
export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
186-
export VLLM_SIM_TAG="<YOUR_TAG>"
194+
* `vllm-sim`: Lightweight simulator for simple environments (default).
195+
* `vllm`: Full vLLM model server, using GPU/CPU for inferencing
196+
* `vllm-p2p`: Full vLLM with LMCache P2P support for enable KV-Cache aware routing
197+
198+
```bash
199+
export VLLM_MODE=vllm-sim # or vllm / vllm-p2p
187200
```
188201

189-
The same thing will need to be done for the EPP:
202+
- Set Hugging Face token variable:
190203

191-
```console
192-
export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
193-
export EPP_TAG="<YOUR_TAG>"
204+
```bash
205+
export HF_TOKEN="<HF_TOKEN>"
194206
```
195207

208+
**Warning**: For vllm mode, the default image uses llama3-8b. Make sure you have permission to access these files in their respective repositories.
209+
210+
**Note:** The model can be replaced. See [Environment Configuration](#environment-configuration) for model settings.
211+
196212
Once all this is set up, you can deploy the environment:
197213

198-
```console
214+
```bash
199215
make environment.dev.kubernetes
200216
```
201217

202218
This will deploy the entire stack to whatever namespace you chose. You can test
203219
by exposing the inference `Gateway` via port-forward:
204220

205-
```console
206-
kubectl -n ${NAMESPACE} port-forward service/inference-gateway-istio 8080:80
221+
```bash
222+
kubectl port-forward service/inference-gateway 8080:80
207223
```
208224

209225
And making requests with `curl`:
210226

211-
```console
227+
**vllm-sim:**
228+
229+
```bash
212230
curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
213231
-d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jq
214232
```
215233

234+
**vllm or vllm-p2p:**
235+
236+
```bash
237+
curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
238+
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"hi","max_tokens":10,"temperature":0}' | jq
239+
```
240+
241+
#### Environment Configurateion
242+
243+
**1. Setting the EPP image and tag:**
244+
245+
You can optionally set a custom EPP image (otherwise, the default will be used):
246+
247+
```bash
248+
export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
249+
export EPP_TAG="<YOUR_TAG>"
250+
```
251+
252+
**2. Setting the vLLM image and tag:**
253+
254+
Each vLLM mode has default image values, but you can override them:
255+
256+
For `vllm-sim` mode:**
257+
258+
```bash
259+
export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
260+
export VLLM_SIM_TAG="<YOUR_TAG>"
261+
```
262+
263+
For `vllm` and `vllm-p2p` modes:**
264+
265+
```bash
266+
export VLLM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
267+
export VLLM_TAG="<YOUR_TAG>"
268+
```
269+
270+
**3. Setting the model name and label:**
271+
272+
You can replace the model name that will be used in the system.
273+
274+
```bash
275+
export MODEL_NAME="${MODEL_NAME:-mistralai/Mistral-7B-Instruct-v0.2}"
276+
export MODEL_LABEL="${MODEL_LABEL:-mistral7b}"
277+
```
278+
279+
It is also recommended to update the inference pool name accordingly so that it aligns with the models:
280+
281+
```bash
282+
export POOL_NAME="${POOL_NAME:-vllm-Mistral-7B-Instruct}"
283+
```
284+
285+
**4. Additional environment settings:**
286+
287+
More Setting of environment variables can be found in the `scripts/kubernetes-dev-env.sh`.
288+
289+
290+
216291
#### Development Cycle
217292

218293
> **WARNING**: This is a very manual process at the moment. We expect to make
@@ -221,19 +296,19 @@ curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: applicati
221296
Make your changes locally and commit them. Then select an image tag based on
222297
the `git` SHA:
223298

224-
```console
299+
```bash
225300
export EPP_TAG=$(git rev-parse HEAD)
226301
```
227302

228303
Build the image:
229304

230-
```console
305+
```bash
231306
DEV_VERSION=$EPP_TAG make image-build
232307
```
233308

234309
Tag the image for your private registry and push it:
235310

236-
```console
311+
```bash
237312
$CONTAINER_RUNTIME tag quay.io/vllm-d/gateway-api-inference-extension/epp:$TAG \
238313
<MY_REGISTRY>/<MY_IMAGE>:$EPP_TAG
239314
$CONTAINER_RUNTIME push <MY_REGISTRY>/<MY_IMAGE>:$EPP_TAG
@@ -245,7 +320,7 @@ $CONTAINER_RUNTIME push <MY_REGISTRY>/<MY_IMAGE>:$EPP_TAG
245320
Then you can re-deploy the environment with the new changes (don't forget all
246321
the required env vars):
247322

248-
```console
323+
```bash
249324
make environment.dev.kubernetes
250325
```
251326

Makefile

+2-5
Original file line numberDiff line numberDiff line change
@@ -784,11 +784,8 @@ environment.dev.kubernetes: check-kubectl check-kustomize check-envsubst
784784
# ------------------------------------------------------------------------------
785785
.PHONY: clean.environment.dev.kubernetes
786786
clean.environment.dev.kubernetes: check-kubectl check-kustomize check-envsubst
787-
ifndef NAMESPACE
788-
$(error "Error: NAMESPACE is required but not set")
789-
endif
790-
@echo "INFO: cleaning up dev environment in $(NAMESPACE)"
791-
kustomize build deploy/environments/dev/kubernetes-kgateway | envsubst | kubectl -n "${NAMESPACE}" delete -f -
787+
@CLEAN=true ./scripts/kubernetes-dev-env.sh 2>&1
788+
@echo "INFO: Finished cleanup of development environment for $(VLLM_MODE) mode in namespace $(NAMESPACE)"
792789

793790
# -----------------------------------------------------------------------------
794791
# TODO: these are old aliases that we still need for the moment, but will be

deploy/components/inference-gateway/deployments.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ spec:
2222
imagePullPolicy: IfNotPresent
2323
args:
2424
- -poolName
25-
- "vllm-llama3-8b-instruct"
25+
- "${POOL_NAME}"
2626
- -v
2727
- "4"
2828
- --zap-encoder

deploy/components/inference-gateway/httproutes.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ spec:
1313
backendRefs:
1414
- group: inference.networking.x-k8s.io
1515
kind: InferencePool
16-
name: vllm-llama3-8b-instruct
16+
name: ${POOL_NAME}
1717
port: 8000
1818
timeouts:
1919
request: 30s

deploy/components/inference-gateway/inference-models.yaml

+11-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,17 @@ spec:
66
modelName: food-review
77
criticality: Critical
88
poolRef:
9-
name: vllm-llama3-8b-instruct
9+
name: ${POOL_NAME}
1010
targetModels:
1111
- name: food-review
1212
weight: 100
13+
---
14+
apiVersion: inference.networking.x-k8s.io/v1alpha2
15+
kind: InferenceModel
16+
metadata:
17+
name: base-model
18+
spec:
19+
modelName: ${MODEL_NAME}
20+
criticality: Critical
21+
poolRef:
22+
name: ${POOL_NAME}
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
apiVersion: inference.networking.x-k8s.io/v1alpha2
22
kind: InferencePool
33
metadata:
4-
name: vllm-llama3-8b-instruct
4+
name: ${POOL_NAME}
55
spec:
66
targetPortNumber: 8000
77
selector:
8-
app: vllm-llama3-8b-instruct
8+
app: ${POOL_NAME}
99
extensionRef:
1010
name: endpoint-picker
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# ------------------------------------------------------------------------------
2+
# vLLM P2P Deployment
3+
#
4+
# This deploys the full vLLM model server, capable of serving real models such
5+
# as Llama 3.1-8B-Instruct via the OpenAI-compatible API. It is intended for
6+
# environments with GPU resources and where full inference capabilities are
7+
# required.
8+
# in additon it add LMcache a LLM serving engine extension using Redis to vLLM image
9+
#
10+
# The deployment can be customized using environment variables to set:
11+
# - The container image and tag (VLLM_IMAGE, VLLM_TAG)
12+
# - The model to load (MODEL_NAME)
13+
#
14+
# This setup is suitable for testing on Kubernetes (including
15+
# GPU-enabled nodes or clusters with scheduling for `nvidia.com/gpu`).
16+
# -----------------------------------------------------------------------------
17+
apiVersion: kustomize.config.k8s.io/v1beta1
18+
kind: Kustomization
19+
20+
resources:
21+
- vllm-deployment.yaml
22+
- redis-deployment.yaml
23+
- redis-service.yaml
24+
- secret.yaml
25+
26+
images:
27+
- name: vllm/vllm-openai
28+
newName: ${VLLM_IMAGE}
29+
newTag: ${VLLM_TAG}
30+
- name: redis
31+
newName: ${REDIS_IMAGE}
32+
newTag: ${REDIS_TAG}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
name: ${REDIS_DEPLOYMENT_NAME}
5+
labels:
6+
app.kubernetes.io/name: redis
7+
app.kubernetes.io/component: redis-lookup-server
8+
spec:
9+
replicas: 1
10+
selector:
11+
matchLabels:
12+
app.kubernetes.io/name: redis
13+
app.kubernetes.io/component: redis-lookup-server
14+
template:
15+
metadata:
16+
labels:
17+
app.kubernetes.io/name: redis
18+
app.kubernetes.io/component: redis-lookup-server
19+
spec:
20+
containers:
21+
- name: lookup-server
22+
image: ${REDIS_IMAGE}:${REDIS_TAG}
23+
imagePullPolicy: IfNotPresent
24+
command:
25+
- redis-server
26+
ports:
27+
- name: redis-port
28+
containerPort: ${REDIS_TARGET_PORT}
29+
protocol: TCP
30+
resources:
31+
limits:
32+
cpu: "4"
33+
memory: 10G
34+
requests:
35+
cpu: "4"
36+
memory: 8G
37+
terminationMessagePath: /dev/termination-log
38+
terminationMessagePolicy: File
39+
restartPolicy: Always
40+
terminationGracePeriodSeconds: 30
41+
dnsPolicy: ClusterFirst
42+
securityContext: {}
43+
schedulerName: default-scheduler
44+
strategy:
45+
type: RollingUpdate
46+
rollingUpdate:
47+
maxUnavailable: 25%
48+
maxSurge: 25%
49+
revisionHistoryLimit: 10
50+
progressDeadlineSeconds: 600

0 commit comments

Comments
 (0)