Skip to content

Commit b189362

Browse files
committed
[fix]: Small fixes for deployment and fix comments
Signed-off-by: Kfir Toledo <[email protected]>
1 parent 1a7fa8e commit b189362

25 files changed

+152
-120
lines changed

DEVELOPMENT.md

+60-31
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,13 @@ Create the namespace:
152152
```console
153153
kubectl create namespace ${NAMESPACE}
154154
```
155+
Set the default namespace for kubectl commands
156+
157+
```console
158+
kubectl config set-context --current --namespace="${NAMESPACE}"
159+
```
160+
161+
> NOTE: If you are using OpenShift (oc CLI), use the following instead: `oc project "${NAMESPACE}"`
155162
156163
You'll need to provide a `Secret` with the login credentials for your private
157164
repository (e.g. quay.io). It should look something like this:
@@ -178,13 +185,6 @@ Export the name of the `Secret` to the environment:
178185
export REGISTRY_SECRET=anna-pull-secret
179186
```
180187

181-
You can optionally set a custom EPP image (otherwise, the default will be used):
182-
183-
```console
184-
export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
185-
export EPP_TAG="<YOUR_TAG>"
186-
```
187-
188188
Set the `VLLM_MODE` environment variable based on which version of vLLM you want to deploy:
189189

190190
- `vllm-sim`: Lightweight simulator for simple environments
@@ -194,24 +194,10 @@ Set the `VLLM_MODE` environment variable based on which version of vLLM you want
194194
```console
195195
export VLLM_MODE=vllm-sim # or vllm / vllm-p2p
196196
```
197-
Each mode has default image values, but you can override them:
198197

199-
For vllm-sim:
200-
201-
```console
202-
export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
203-
export VLLM_SIM_TAG="<YOUR_TAG>"
204-
```
205-
206-
For vllm and vllm-p2p:
207-
- set Vllm image:
208-
```console
209-
export VLLM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
210-
export VLLM_TAG="<YOUR_TAG>"
211-
```
212198
- Set hugging face token variable:
213199
export HF_TOKEN="<HF_TOKEN>"
214-
**Warning**: For vllm mode, the default image uses llama3-8b and vllm-mistral. Make sure you have permission to access these files in their respective repositories.
200+
**Warning**: For vllm mode, the default image uses llama3-8b. Make sure you have permission to access these files in their respective repositories.
215201

216202
Once all this is set up, you can deploy the environment:
217203

@@ -222,30 +208,73 @@ make environment.dev.kubernetes
222208
This will deploy the entire stack to whatever namespace you chose. You can test
223209
by exposing the inference `Gateway` via port-forward:
224210

225-
```console
211+
```bash
226212
kubectl -n ${NAMESPACE} port-forward service/inference-gateway 8080:80
227213
```
228214

229215
And making requests with `curl`:
230216
- vllm-sim
231217

232-
```console
218+
```bash
233219
curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
234220
-d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jq
235221
```
236222

237-
- vllm
223+
- vllm or vllm-p2p
238224

239-
```console
225+
```bash
240226
curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
241227
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"hi","max_tokens":10,"temperature":0}' | jq
242228
```
229+
#### Environment Configurateion
230+
231+
##### **1. Setting the EPP image and tag:**
232+
233+
You can optionally set a custom EPP image (otherwise, the default will be used):
234+
235+
```bash
236+
export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
237+
export EPP_TAG="<YOUR_TAG>"
238+
```
239+
##### **2. Setting the vLLM image and tag:**
240+
241+
Each vLLM mode has default image values, but you can override them:
242+
243+
For `vllm-sim` mode:**
244+
245+
```bash
246+
export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
247+
export VLLM_SIM_TAG="<YOUR_TAG>"
248+
```
249+
250+
For `vllm` and `vllm-p2p` modes:**
251+
252+
```bash
253+
export VLLM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
254+
export VLLM_TAG="<YOUR_TAG>"
255+
```
256+
257+
##### **3. Setting the model name and label:**
258+
259+
You can replace the model name that will be used in the system.
260+
261+
```bash
262+
export MODEL_NAME="${MODEL_NAME:-mistralai/Mistral-7B-Instruct-v0.2}"
263+
export MODEL_LABEL="${MODEL_LABEL:-mistral7b}"
264+
```
265+
266+
It is also recommended to update the pool name accordingly:
267+
268+
```bash
269+
export POOL_NAME="${POOL_NAME:-vllm-Mistral-7B-Instruct}"
270+
```
271+
272+
##### **4. Additional environment settings:**
273+
274+
More Setting of environment variables can be found in the `scripts/kubernetes-dev-env.sh`.
275+
276+
243277

244-
- vllm-p2p
245-
```console
246-
curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
247-
-d '{"model":"mistralai/Mistral-7B-Instruct-v0.2","prompt":"hi","max_tokens":10,"temperature":0}' | jq
248-
```
249278
#### Development Cycle
250279

251280
> **WARNING**: This is a very manual process at the moment. We expect to make

deploy/components/inference-gateway/deployments.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ spec:
2222
imagePullPolicy: IfNotPresent
2323
args:
2424
- -poolName
25-
- "vllm-llama3-8b-instruct"
25+
- "${POOL_NAME}"
2626
- -v
2727
- "4"
2828
- --zap-encoder
@@ -55,4 +55,4 @@ spec:
5555
valueFrom:
5656
secretKeyRef:
5757
name: ${HF_SECRET_NAME}
58-
key: ${HF_SECRET_KEY}
58+
key: ${HF_SECRET_KEY}

deploy/components/inference-gateway/httproutes.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ spec:
1313
backendRefs:
1414
- group: inference.networking.x-k8s.io
1515
kind: InferencePool
16-
name: vllm-llama3-8b-instruct
16+
name: ${POOL_NAME}
1717
port: 8000
1818
timeouts:
1919
request: 30s

deploy/components/inference-gateway/inference-models.yaml

+1-21
Original file line numberDiff line numberDiff line change
@@ -16,27 +16,7 @@ kind: InferenceModel
1616
metadata:
1717
name: base-model
1818
spec:
19-
modelName: meta-llama/Llama-3.1-8B-Instruct
19+
modelName: ${MODEL_NAME}
2020
criticality: Critical
2121
poolRef:
2222
name: ${POOL_NAME}
23-
---
24-
apiVersion: inference.networking.x-k8s.io/v1alpha2
25-
kind: InferenceModel
26-
metadata:
27-
name: base-model-cpu
28-
spec:
29-
modelName: Qwen/Qwen2.5-1.5B-Instruct
30-
criticality: Critical
31-
poolRef:
32-
name: ${POOL_NAME}
33-
---
34-
apiVersion: inference.networking.x-k8s.io/v1alpha2
35-
kind: InferenceModel
36-
metadata:
37-
name: mistarli
38-
spec:
39-
modelName: mistralai/Mistral-7B-Instruct-v0.2
40-
criticality: Critical
41-
poolRef:
42-
name: ${POOL_NAME}

deploy/components/inference-gateway/kustomization.yaml

+2
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ resources:
2626
- deployments.yaml
2727
- gateways.yaml
2828
- httproutes.yaml
29+
- secret.yaml
30+
2931

3032
images:
3133
- name: quay.io/vllm-d/gateway-api-inference-extension/epp

deploy/components/vllm-p2p/deployments/secret.yaml renamed to deploy/components/inference-gateway/secret.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@ apiVersion: v1
22
kind: Secret
33
metadata:
44
name: ${HF_SECRET_NAME}
5-
namespace: ${NAMESPACE}
65
labels:
76
app.kubernetes.io/name: vllm
87
app.kubernetes.io/component: secret

deploy/components/vllm-p2p/kustomization.yaml

+20-6
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,27 @@
1+
# ------------------------------------------------------------------------------
2+
# vLLM P2P Deployment
3+
#
4+
# This deploys the full vLLM model server, capable of serving real models such
5+
# as Llama 3.1-8B-Instruct via the OpenAI-compatible API. It is intended for
6+
# environments with GPU resources and where full inference capabilities are
7+
# required.
8+
# in additon it add LMcache a LLM serving engine extension using Redis to vLLM image
9+
#
10+
# The deployment can be customized using environment variables to set:
11+
# - The container image and tag (VLLM_IMAGE, VLLM_TAG)
12+
# - The model to load (MODEL_NAME)
13+
#
14+
# This setup is suitable for testing and production with Kubernetes (including
15+
# GPU-enabled nodes or clusters with scheduling for `nvidia.com/gpu`).
16+
# -----------------------------------------------------------------------------
117
apiVersion: kustomize.config.k8s.io/v1beta1
218
kind: Kustomization
319

4-
namespace: ${NAMESPACE}
5-
620
resources:
7-
- deployments/vllm-deployment.yaml
8-
- deployments/redis-deployment.yaml
9-
- service/redis-service.yaml
10-
- deployments/secret.yaml
21+
- vllm-deployment.yaml
22+
- redis-deployment.yaml
23+
- redis-service.yaml
24+
- secret.yaml
1125

1226
images:
1327
- name: vllm/vllm-openai

deploy/components/vllm-p2p/deployments/redis-deployment.yaml renamed to deploy/components/vllm-p2p/redis-deployment.yaml

+1-6
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
apiVersion: apps/v1
22
kind: Deployment
33
metadata:
4-
name: ${REDIS_SVC_NAME}
4+
name: ${REDIS_DEPLOYMENT_NAME}
55
labels:
66
app.kubernetes.io/name: redis
77
app.kubernetes.io/component: redis-lookup-server
@@ -48,8 +48,3 @@ spec:
4848
maxSurge: 25%
4949
revisionHistoryLimit: 10
5050
progressDeadlineSeconds: 600
51-
# securityContext:
52-
# allowPrivilegeEscalation: false
53-
# capabilities:
54-
# drop:
55-
# - ALL
+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
apiVersion: v1
2+
kind: Secret
3+
metadata:
4+
name: ${HF_SECRET_NAME}
5+
labels:
6+
app.kubernetes.io/name: vllm
7+
app.kubernetes.io/component: secret
8+
type: Opaque
9+
data:
10+
${HF_SECRET_KEY}: ${HF_TOKEN}

deploy/components/vllm-p2p/deployments/vllm-deployment.yaml renamed to deploy/components/vllm-p2p/vllm-deployment.yaml

-5
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,6 @@ spec:
2222
app.kubernetes.io/model: ${MODEL_LABEL}
2323
app: ${POOL_NAME}
2424
spec:
25-
# securityContext:
26-
# runAsUser: ${PROXY_UID}
27-
# runAsNonRoot: true
28-
# seccompProfile:
29-
# type: RuntimeDefault
3025
containers:
3126
- name: vllm
3227
image: ${VLLM_IMAGE}:${VLLM_TAG}

deploy/components/vllm-sim/deployments.yaml

+3-3
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,16 @@ kind: Deployment
33
metadata:
44
name: vllm-sim
55
labels:
6-
app: vllm-llama3-8b-instruct
6+
app: ${POOL_NAME}
77
spec:
88
replicas: 1
99
selector:
1010
matchLabels:
11-
app: vllm-llama3-8b-instruct
11+
app: ${POOL_NAME}
1212
template:
1313
metadata:
1414
labels:
15-
app: vllm-llama3-8b-instruct
15+
app: ${POOL_NAME}
1616
ai-aware-router-pod: "true"
1717
spec:
1818
containers:

deploy/components/vllm/configmap.yaml

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
apiVersion: v1
22
kind: ConfigMap
33
metadata:
4-
name: vllm-llama3-8b-instruct-adapters
4+
name: lora-adapters
55
data:
66
configmap.yaml: |
77
vLLMLoRAConfig:
8-
name: vllm-llama3-8b-instruct-adapters
8+
name: lora-adapters
99
port: 8000
10-
defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct
10+
defaultBaseModel: ${MODEL_NAME}
1111
ensureExist:
1212
models:
1313
- id: food-review-1

deploy/components/vllm/deployments.yaml

+4-14
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ spec:
2424
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
2525
args:
2626
- "--model"
27-
- "meta-llama/Llama-3.1-8B-Instruct"
27+
- "${MODEL_NAME}"
2828
- "--tensor-parallel-size"
2929
- "1"
3030
- "--port"
@@ -48,8 +48,8 @@ spec:
4848
- name: HUGGING_FACE_HUB_TOKEN
4949
valueFrom:
5050
secretKeyRef:
51-
name: hf-token
52-
key: token
51+
name: ${HF_SECRET_NAME}
52+
key: ${HF_SECRET_KEY}
5353
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
5454
value: "true"
5555
- name: XDG_CACHE_HOME
@@ -104,11 +104,6 @@ spec:
104104
name: shm
105105
- mountPath: /adapters
106106
name: adapters
107-
securityContext:
108-
allowPrivilegeEscalation: false
109-
capabilities:
110-
drop:
111-
- ALL
112107
initContainers:
113108
- name: lora-adapter-syncer
114109
tty: true
@@ -122,11 +117,6 @@ spec:
122117
volumeMounts:
123118
- name: config-volume
124119
mountPath: /config
125-
securityContext:
126-
allowPrivilegeEscalation: false
127-
capabilities:
128-
drop:
129-
- ALL
130120
restartPolicy: Always
131121
enableServiceLinks: false
132122
terminationGracePeriodSeconds: 130
@@ -140,4 +130,4 @@ spec:
140130
emptyDir: {}
141131
- name: config-volume
142132
configMap:
143-
name: vllm-llama3-8b-instruct-adapters
133+
name: lora-adapters

deploy/components/vllm/kustomization.yaml

+4-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,6 @@ kind: Kustomization
1717

1818
resources:
1919
- deployments.yaml
20-
- secret.yaml
2120
- configmap.yaml
2221

2322

@@ -26,6 +25,10 @@ images:
2625
newName: ${VLLM_IMAGE}
2726
newTag: ${VLLM_TAG}
2827

28+
- name: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer
29+
newName: ${LORA_ADAPTER_SYNCER_IMAGE}
30+
newTag: ${LORA_ADAPTER_SYNCER_TAG}
31+
2932
configMapGenerator:
3033
- name: vllm-model-config
3134
literals:

deploy/components/vllm/secret.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@ apiVersion: v1
22
kind: Secret
33
metadata:
44
name: ${HF_SECRET_NAME}
5-
namespace: ${NAMESPACE}
65
labels:
76
app.kubernetes.io/name: vllm
87
app.kubernetes.io/component: secret

0 commit comments

Comments
 (0)