neuralmagic · kfirtoledo · Apr 29, 2025 · Apr 25, 2025 · Apr 25, 2025 · Apr 26, 2025
diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -37,7 +37,7 @@ serving resources.
 
 Run the following:
 
-```console
+```bash
 make environment.dev.kind
 ```
 
@@ -48,13 +48,15 @@ namespace.
 There are several ways to access the gateway:
 
 **Port forward**:
+
 ```sh
 $ kubectl --context kind-gie-dev port-forward service/inference-gateway 8080:80
 ```
 
 **NodePort `inference-gateway-istio`**
 > **Warning**: This method doesn't work on `podman` correctly, as `podman` support
 > with `kind` is not fully implemented yet.
+
 ```sh
 # Determine the k8s node address
 $ kubectl --context kind-gie-dev get node -o yaml | grep address
@@ -80,9 +82,10 @@ By default the created inference gateway, can be accessed on port 30080. This ca
 be overriden to any free port in the range of 30000 to 32767, by running the above
 command as follows:
 
-```console
+```bash
 GATEWAY_HOST_PORT=<selected-port> make environment.dev.kind
 ```
+
 **Where:** &lt;selected-port&gt; is the port on your local machine you want to use to
 access the inference gatyeway.
 
@@ -96,7 +99,7 @@ access the inference gatyeway.
 To test your changes to the GIE in this environment, make your changes locally
 and then run the following:
 
-```console
+```bash
 make environment.dev.kind.update
 ```
 
@@ -122,7 +125,7 @@ the `default` namespace if the cluster is private/personal).
 The following will deploy all the infrastructure-level requirements (e.g. CRDs,
 Operators, etc) to support the namespace-level development environments:
 
-```console
+```bash
 make environment.dev.kubernetes.infrastructure
 ```
 
@@ -140,7 +143,7 @@ To deploy a development environment to the cluster you'll need to explicitly
 provide a namespace. This can be `default` if this is your personal cluster,
 but on a shared cluster you should pick something unique. For example:
 
-```console
+```bash
 export NAMESPACE=annas-dev-environment
 ```
 
@@ -149,10 +152,18 @@ export NAMESPACE=annas-dev-environment
 
 Create the namespace:
 
-```console
+```bash
 kubectl create namespace ${NAMESPACE}
 ```
 
+Set the default namespace for kubectl commands
+
+```bash
+kubectl config set-context --current --namespace="${NAMESPACE}"
+```
+
+> NOTE: If you are using OpenShift (oc CLI), use the following instead: `oc project "${NAMESPACE}"`
+
 You'll need to provide a `Secret` with the login credentials for your private
 repository (e.g. quay.io). It should look something like this:
 
@@ -168,51 +179,115 @@ type: kubernetes.io/dockerconfigjson
 
 Apply that to your namespace:
 
-```console
-kubectl -n ${NAMESPACE} apply -f secret.yaml
+```bash
+kubectl apply -f secret.yaml
 ```
 
 Export the name of the `Secret` to the environment:
 
-```console
+```bash
 export REGISTRY_SECRET=anna-pull-secret
 ```
 
-Now you need to provide several other environment variables. You'll need to
-indicate the location and tag of the `vllm-sim` image:
+Set the `VLLM_MODE` environment variable based on which version of vLLM you want to deploy:
 
-```console
-export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
-export VLLM_SIM_TAG="<YOUR_TAG>"
+* `vllm-sim`: Lightweight simulator for simple environments (default).
+* `vllm`: Full vLLM model server, using GPU/CPU for inferencing
+* `vllm-p2p`: Full vLLM with LMCache P2P support for enable KV-Cache aware routing
+
+```bash
+export VLLM_MODE=vllm-sim  # or vllm / vllm-p2p
 ```
 
-The same thing will need to be done for the EPP:
+- Set Hugging Face token variable:
 
-```console
-export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
-export EPP_TAG="<YOUR_TAG>"
+```bash
+export HF_TOKEN="<HF_TOKEN>"
 ```
 
+**Warning**: For vllm mode, the default image uses llama3-8b. Make sure you have permission to access these files in their respective repositories.
+
+**Note:** The model can be replaced. See [Environment Configuration](#environment-configuration) for model settings.
+
 Once all this is set up, you can deploy the environment:
 
-```console
+```bash
 make environment.dev.kubernetes
 ```
 
 This will deploy the entire stack to whatever namespace you chose. You can test
 by exposing the inference `Gateway` via port-forward:
 
-```console
-kubectl -n ${NAMESPACE} port-forward service/inference-gateway-istio 8080:80
+```bash
+kubectl port-forward service/inference-gateway 8080:80
 ```
 
 And making requests with `curl`:
 
-```console
+**vllm-sim:**
+
+```bash
 curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
   -d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jq
 ```
 
+**vllm or vllm-p2p:**
+
+```bash
+curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
+  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"hi","max_tokens":10,"temperature":0}' | jq
+```
+
+#### Environment Configurateion
+
+**1. Setting the EPP image and tag:**
+
+You can optionally set a custom EPP image (otherwise, the default will be used):
+
+```bash
+export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
+export EPP_TAG="<YOUR_TAG>"
+```
+
+**2. Setting the vLLM image and tag:**
+
+Each vLLM mode has default image values, but you can override them:
+
+For `vllm-sim` mode:**
+
+```bash
+export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
+export VLLM_SIM_TAG="<YOUR_TAG>"
+```
+
+For `vllm` and `vllm-p2p` modes:**
+
+```bash
+export VLLM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
+export VLLM_TAG="<YOUR_TAG>"
+```
+
+**3. Setting the model name and label:**
+
+You can replace the model name that will be used in the system.
+
+```bash
+export MODEL_NAME="${MODEL_NAME:-mistralai/Mistral-7B-Instruct-v0.2}"
+export MODEL_LABEL="${MODEL_LABEL:-mistral7b}"
+```
+
+It is also recommended to update the inference pool name accordingly so that it aligns with the models:
+
+```bash
+export POOL_NAME="${POOL_NAME:-vllm-Mistral-7B-Instruct}"
+```
+
+**4. Additional environment settings:**
+
+More Setting of environment variables can be found in the `scripts/kubernetes-dev-env.sh`.
+
+
+
 #### Development Cycle
 
 > **WARNING**: This is a very manual process at the moment. We expect to make
@@ -221,19 +296,19 @@ curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: applicati
 Make your changes locally and commit them. Then select an image tag based on
 the `git` SHA:
 
-```console
+```bash
 export EPP_TAG=$(git rev-parse HEAD)
 ```
 
 Build the image:
 
-```console
+```bash
 DEV_VERSION=$EPP_TAG make image-build
 ```
 
 Tag the image for your private registry and push it:
 
-```console
+```bash
 $CONTAINER_RUNTIME tag quay.io/vllm-d/gateway-api-inference-extension/epp:$TAG \
     <MY_REGISTRY>/<MY_IMAGE>:$EPP_TAG
 $CONTAINER_RUNTIME push <MY_REGISTRY>/<MY_IMAGE>:$EPP_TAG
@@ -245,7 +320,7 @@ $CONTAINER_RUNTIME push <MY_REGISTRY>/<MY_IMAGE>:$EPP_TAG
 Then you can re-deploy the environment with the new changes (don't forget all
 the required env vars):
 
-```console
+```bash
 make environment.dev.kubernetes
 ```
 

diff --git a/Makefile b/Makefile
@@ -784,11 +784,8 @@ environment.dev.kubernetes: check-kubectl check-kustomize check-envsubst
 # ------------------------------------------------------------------------------
 .PHONY: clean.environment.dev.kubernetes
 clean.environment.dev.kubernetes: check-kubectl check-kustomize check-envsubst
-ifndef NAMESPACE
-	$(error "Error: NAMESPACE is required but not set")
-endif
-	@echo "INFO: cleaning up dev environment in $(NAMESPACE)"
-	kustomize build deploy/environments/dev/kubernetes-kgateway | envsubst | kubectl -n "${NAMESPACE}" delete -f -
+	@CLEAN=true ./scripts/kubernetes-dev-env.sh 2>&1
+	@echo "INFO: Finished cleanup of development environment for $(VLLM_MODE) mode in namespace $(NAMESPACE)"
 
 # -----------------------------------------------------------------------------
 # TODO: these are old aliases that we still need for the moment, but will be

diff --git a/deploy/components/inference-gateway/deployments.yaml b/deploy/components/inference-gateway/deployments.yaml
@@ -22,7 +22,7 @@ spec:
         imagePullPolicy: IfNotPresent
         args:
         - -poolName
-        - "vllm-llama3-8b-instruct"
+        - "${POOL_NAME}"
         - -v
         - "4"
         - --zap-encoder

diff --git a/deploy/components/inference-gateway/httproutes.yaml b/deploy/components/inference-gateway/httproutes.yaml
@@ -13,7 +13,7 @@ spec:
     backendRefs:
     - group: inference.networking.x-k8s.io
       kind: InferencePool
-      name: vllm-llama3-8b-instruct
+      name: ${POOL_NAME}
       port: 8000
     timeouts:
       request: 30s
diff --git a/deploy/components/inference-gateway/inference-models.yaml b/deploy/components/inference-gateway/inference-models.yaml
@@ -6,7 +6,17 @@ spec:
   modelName: food-review
   criticality: Critical
   poolRef:
-    name: vllm-llama3-8b-instruct
+    name: ${POOL_NAME}
   targetModels:
   - name: food-review
     weight: 100
+---
+apiVersion: inference.networking.x-k8s.io/v1alpha2
+kind: InferenceModel
+metadata:
+  name: base-model
+spec:
+  modelName: ${MODEL_NAME}
+  criticality: Critical
+  poolRef:
+    name: ${POOL_NAME}
diff --git a/deploy/components/inference-gateway/inference-pools.yaml b/deploy/components/inference-gateway/inference-pools.yaml
@@ -1,10 +1,10 @@
 apiVersion: inference.networking.x-k8s.io/v1alpha2
 kind: InferencePool
 metadata:
-  name: vllm-llama3-8b-instruct
+  name: ${POOL_NAME}
 spec:
   targetPortNumber: 8000
   selector:
-    app: vllm-llama3-8b-instruct
+    app: ${POOL_NAME}
   extensionRef:
     name: endpoint-picker
diff --git a/deploy/components/vllm-p2p/kustomization.yaml b/deploy/components/vllm-p2p/kustomization.yaml
@@ -0,0 +1,32 @@
+# ------------------------------------------------------------------------------
+# vLLM P2P Deployment
+#
+# This deploys the full vLLM model server, capable of serving real models such
+# as Llama 3.1-8B-Instruct via the OpenAI-compatible API. It is intended for
+# environments with GPU resources and where full inference capabilities are
+# required.
+# in additon it add LMcache  a LLM serving engine extension using Redis to vLLM image
+#
+# The deployment can be customized using environment variables to set:
+#   - The container image and tag (VLLM_IMAGE, VLLM_TAG)
+#   - The model to load (MODEL_NAME)
+#
+# This setup is suitable for testing on Kubernetes (including
+# GPU-enabled nodes or clusters with scheduling for `nvidia.com/gpu`).
+# -----------------------------------------------------------------------------
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+resources:
+  - vllm-deployment.yaml
+  - redis-deployment.yaml
+  - redis-service.yaml
+  - secret.yaml
+
+images:
+  - name: vllm/vllm-openai
+    newName: ${VLLM_IMAGE}
+    newTag: ${VLLM_TAG}
+  - name: redis
+    newName: ${REDIS_IMAGE}
+    newTag: ${REDIS_TAG}
diff --git a/deploy/components/vllm-p2p/redis-deployment.yaml b/deploy/components/vllm-p2p/redis-deployment.yaml
@@ -0,0 +1,50 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: ${REDIS_DEPLOYMENT_NAME}
+  labels:
+    app.kubernetes.io/name: redis
+    app.kubernetes.io/component: redis-lookup-server
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: redis
+      app.kubernetes.io/component: redis-lookup-server
+  template:
+    metadata:
+      labels:
+        app.kubernetes.io/name: redis
+        app.kubernetes.io/component: redis-lookup-server
+    spec:
+      containers:
+        - name: lookup-server
+          image: ${REDIS_IMAGE}:${REDIS_TAG}
+          imagePullPolicy: IfNotPresent
+          command:
+            - redis-server
+          ports:
+            - name: redis-port
+              containerPort: ${REDIS_TARGET_PORT}
+              protocol: TCP
+          resources:
+            limits:
+              cpu: "4"
+              memory: 10G
+            requests:
+              cpu: "4"
+              memory: 8G
+          terminationMessagePath: /dev/termination-log
+          terminationMessagePolicy: File
+      restartPolicy: Always
+      terminationGracePeriodSeconds: 30
+      dnsPolicy: ClusterFirst
+      securityContext: {}
+      schedulerName: default-scheduler
+  strategy:
+    type: RollingUpdate
+    rollingUpdate:
+      maxUnavailable: 25%
+      maxSurge: 25%
+  revisionHistoryLimit: 10
+  progressDeadlineSeconds: 600