neuralmagic
diff --git a/‎DEVELOPMENT.md
Lines changed: 60 additions & 31 deletions b/‎DEVELOPMENT.md
Lines changed: 60 additions & 31 deletions
diff --git a/‎deploy/components/inference-gateway/deployments.yaml
Lines changed: 2 additions & 2 deletions b/‎deploy/components/inference-gateway/deployments.yaml
Lines changed: 2 additions & 2 deletions
diff --git a/‎deploy/components/inference-gateway/httproutes.yaml
Lines changed: 1 addition & 1 deletion b/‎deploy/components/inference-gateway/httproutes.yaml
Lines changed: 1 addition & 1 deletion
diff --git a/‎deploy/components/inference-gateway/inference-models.yaml
Lines changed: 1 addition & 21 deletions b/‎deploy/components/inference-gateway/inference-models.yaml
Lines changed: 1 addition & 21 deletions
diff --git a/‎deploy/components/inference-gateway/kustomization.yaml
Lines changed: 2 additions & 0 deletions b/‎deploy/components/inference-gateway/kustomization.yaml
Lines changed: 2 additions & 0 deletions
diff --git a/‎deploy/components/vllm-p2p/deployments/secret.yaml renamed to ‎deploy/components/inference-gateway/secret.yaml
Lines changed: 0 additions & 1 deletion b/‎deploy/components/vllm-p2p/deployments/secret.yaml renamed to ‎deploy/components/inference-gateway/secret.yaml
Lines changed: 0 additions & 1 deletion
diff --git a/‎deploy/components/vllm-p2p/kustomization.yaml
Lines changed: 20 additions & 6 deletions b/‎deploy/components/vllm-p2p/kustomization.yaml
Lines changed: 20 additions & 6 deletions
diff --git a/‎deploy/components/vllm-p2p/deployments/redis-deployment.yaml renamed to ‎deploy/components/vllm-p2p/redis-deployment.yaml
Lines changed: 1 addition & 6 deletions b/‎deploy/components/vllm-p2p/deployments/redis-deployment.yaml renamed to ‎deploy/components/vllm-p2p/redis-deployment.yaml
Lines changed: 1 addition & 6 deletions
diff --git a/‎deploy/components/vllm-p2p/service/redis-service.yaml renamed to ‎deploy/components/vllm-p2p/redis-service.yaml b/‎deploy/components/vllm-p2p/service/redis-service.yaml renamed to ‎deploy/components/vllm-p2p/redis-service.yaml
diff --git a/‎deploy/components/vllm-p2p/secret.yaml
Lines changed: 10 additions & 0 deletions b/‎deploy/components/vllm-p2p/secret.yaml
Lines changed: 10 additions & 0 deletions
@@ -152,6 +152,13 @@ Create the namespace:
 ```console
 kubectl create namespace ${NAMESPACE}
 ```
+Set the default namespace for kubectl commands
+
+```console
+kubectl config set-context --current --namespace="${NAMESPACE}"
+```
+
+> NOTE: If you are using OpenShift (oc CLI), use the following instead: `oc project "${NAMESPACE}"`
 
 You'll need to provide a `Secret` with the login credentials for your private
 repository (e.g. quay.io). It should look something like this:
@@ -178,13 +185,6 @@ Export the name of the `Secret` to the environment:
 export REGISTRY_SECRET=anna-pull-secret
 ```
 
-You can optionally set a custom EPP image (otherwise, the default will be used):
-
-```console
-export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
-export EPP_TAG="<YOUR_TAG>"
-```
-
 Set the `VLLM_MODE` environment variable based on which version of vLLM you want to deploy:
 
 - `vllm-sim`: Lightweight simulator for simple environments
@@ -194,24 +194,10 @@ Set the `VLLM_MODE` environment variable based on which version of vLLM you want
 ```console
 export VLLM_MODE=vllm-sim  # or vllm / vllm-p2p
 ```
-Each mode has default image values, but you can override them:
 
-For vllm-sim:
-
-```console
-export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
-export VLLM_SIM_TAG="<YOUR_TAG>"
-```
-
-For vllm and vllm-p2p:
-- set Vllm image:
-```console
-export VLLM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
-export VLLM_TAG="<YOUR_TAG>"
-```
 - Set hugging face token variable:
   export HF_TOKEN="<HF_TOKEN>"
-**Warning**: For vllm mode, the default image uses llama3-8b and vllm-mistral. Make sure you have permission to access these files in their respective repositories.
+**Warning**: For vllm mode, the default image uses llama3-8b. Make sure you have permission to access these files in their respective repositories.
 
 Once all this is set up, you can deploy the environment:
 
@@ -222,30 +208,73 @@ make environment.dev.kubernetes
 This will deploy the entire stack to whatever namespace you chose. You can test
 by exposing the inference `Gateway` via port-forward:
 
-```console
+```bash
 kubectl -n ${NAMESPACE} port-forward service/inference-gateway 8080:80
 ```
 
 And making requests with `curl`:
 - vllm-sim
 
-    ```console
+    ```bash
     curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
       -d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jq
     ```
 
-- vllm
+- vllm or vllm-p2p
 
-  ```console
+  ```bash
   curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
     -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"hi","max_tokens":10,"temperature":0}' | jq
   ```
+#### Environment Configurateion
+
+##### **1. Setting the EPP image and tag:**
+
+You can optionally set a custom EPP image (otherwise, the default will be used):
+
+```bash
+export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
+export EPP_TAG="<YOUR_TAG>"
+```
+##### **2. Setting the vLLM image and tag:**
+
+Each vLLM mode has default image values, but you can override them:
+
+For `vllm-sim` mode:**
+
+```bash
+export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
+export VLLM_SIM_TAG="<YOUR_TAG>"
+```
+
+For `vllm` and `vllm-p2p` modes:**
+
+```bash
+export VLLM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
+export VLLM_TAG="<YOUR_TAG>"
+```
+
+##### **3. Setting the model name and label:**
+
+You can replace the model name that will be used in the system.
+
+```bash
+export MODEL_NAME="${MODEL_NAME:-mistralai/Mistral-7B-Instruct-v0.2}"
+export MODEL_LABEL="${MODEL_LABEL:-mistral7b}"
+```
+
+It is also recommended to update the pool name accordingly:
+
+```bash
+export POOL_NAME="${POOL_NAME:-vllm-Mistral-7B-Instruct}"
+```
+
+##### **4. Additional environment settings:**
+
+More Setting of environment variables can be found in the `scripts/kubernetes-dev-env.sh`.
+
+
 
-- vllm-p2p
-  ```console
-  curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
-    -d '{"model":"mistralai/Mistral-7B-Instruct-v0.2","prompt":"hi","max_tokens":10,"temperature":0}' | jq
-  ```
 #### Development Cycle
 
 > **WARNING**: This is a very manual process at the moment. We expect to make
 
@@ -22,7 +22,7 @@ spec:
         imagePullPolicy: IfNotPresent
         args:
         - -poolName
-        - "vllm-llama3-8b-instruct"
+        - "${POOL_NAME}"
         - -v
         - "4"
         - --zap-encoder
@@ -55,4 +55,4 @@ spec:
             valueFrom:
               secretKeyRef:
                 name: ${HF_SECRET_NAME}
-                key: ${HF_SECRET_KEY}
+                key: ${HF_SECRET_KEY}
@@ -13,7 +13,7 @@ spec:
     backendRefs:
     - group: inference.networking.x-k8s.io
       kind: InferencePool
-      name: vllm-llama3-8b-instruct
+      name: ${POOL_NAME}
       port: 8000
     timeouts:
       request: 30s
@@ -16,27 +16,7 @@ kind: InferenceModel
 metadata:
   name: base-model
 spec:
-  modelName: meta-llama/Llama-3.1-8B-Instruct
+  modelName: ${MODEL_NAME}
   criticality: Critical
   poolRef:
     name: ${POOL_NAME}
----
-apiVersion: inference.networking.x-k8s.io/v1alpha2
-kind: InferenceModel
-metadata:
-  name: base-model-cpu
-spec:
-  modelName: Qwen/Qwen2.5-1.5B-Instruct
-  criticality: Critical
-  poolRef:
-    name: ${POOL_NAME}
----
-apiVersion: inference.networking.x-k8s.io/v1alpha2
-kind: InferenceModel
-metadata:
-  name: mistarli
-spec:
-  modelName: mistralai/Mistral-7B-Instruct-v0.2
-  criticality: Critical
-  poolRef:
-    name: ${POOL_NAME}
@@ -26,6 +26,8 @@ resources:
 - deployments.yaml
 - gateways.yaml
 - httproutes.yaml
+- secret.yaml
+
 
 images:
 - name: quay.io/vllm-d/gateway-api-inference-extension/epp
 
@@ -2,7 +2,6 @@ apiVersion: v1
 kind: Secret
 metadata:
   name: ${HF_SECRET_NAME}
-  namespace: ${NAMESPACE}
   labels:
     app.kubernetes.io/name: vllm
     app.kubernetes.io/component: secret
 
@@ -1,13 +1,27 @@
+# ------------------------------------------------------------------------------
+# vLLM P2P Deployment
+#
+# This deploys the full vLLM model server, capable of serving real models such
+# as Llama 3.1-8B-Instruct via the OpenAI-compatible API. It is intended for
+# environments with GPU resources and where full inference capabilities are
+# required.
+# in additon it add LMcache  a LLM serving engine extension using Redis to vLLM image
+#
+# The deployment can be customized using environment variables to set:
+#   - The container image and tag (VLLM_IMAGE, VLLM_TAG)
+#   - The model to load (MODEL_NAME)
+#
+# This setup is suitable for testing and production with Kubernetes (including
+# GPU-enabled nodes or clusters with scheduling for `nvidia.com/gpu`).
+# -----------------------------------------------------------------------------
 apiVersion: kustomize.config.k8s.io/v1beta1
 kind: Kustomization
 
-namespace: ${NAMESPACE}
-
 resources:
-  - deployments/vllm-deployment.yaml
-  - deployments/redis-deployment.yaml
-  - service/redis-service.yaml
-  - deployments/secret.yaml
+  - vllm-deployment.yaml
+  - redis-deployment.yaml
+  - redis-service.yaml
+  - secret.yaml
 
 images:
   - name: vllm/vllm-openai
 
@@ -1,7 +1,7 @@
 apiVersion: apps/v1
 kind: Deployment
 metadata:
-  name: ${REDIS_SVC_NAME}
+  name: ${REDIS_DEPLOYMENT_NAME}
   labels:
     app.kubernetes.io/name: redis
     app.kubernetes.io/component: redis-lookup-server
@@ -48,8 +48,3 @@ spec:
       maxSurge: 25%
   revisionHistoryLimit: 10
   progressDeadlineSeconds: 600
-          # securityContext:
-          #   allowPrivilegeEscalation: false
-          #   capabilities:
-          #     drop:
-          #       - ALL
@@ -0,0 +1,10 @@
+apiVersion: v1
+kind: Secret
+metadata:
+  name: ${HF_SECRET_NAME}
+  labels:
+    app.kubernetes.io/name: vllm
+    app.kubernetes.io/component: secret
+type: Opaque
+data:
+  ${HF_SECRET_KEY}: ${HF_TOKEN}