diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 8b6cf443..f6a7c466 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -37,7 +37,7 @@ serving resources. Run the following: -```console +```bash make environment.dev.kind ``` @@ -48,6 +48,7 @@ namespace. There are several ways to access the gateway: **Port forward**: + ```sh $ kubectl --context kind-gie-dev port-forward service/inference-gateway 8080:80 ``` @@ -55,6 +56,7 @@ $ kubectl --context kind-gie-dev port-forward service/inference-gateway 8080:80 **NodePort `inference-gateway-istio`** > **Warning**: This method doesn't work on `podman` correctly, as `podman` support > with `kind` is not fully implemented yet. + ```sh # Determine the k8s node address $ kubectl --context kind-gie-dev get node -o yaml | grep address @@ -80,9 +82,10 @@ By default the created inference gateway, can be accessed on port 30080. This ca be overriden to any free port in the range of 30000 to 32767, by running the above command as follows: -```console +```bash GATEWAY_HOST_PORT=<selected-port> make environment.dev.kind ``` + **Where:** <selected-port> is the port on your local machine you want to use to access the inference gatyeway. @@ -96,7 +99,7 @@ access the inference gatyeway. To test your changes to the GIE in this environment, make your changes locally and then run the following: -```console +```bash make environment.dev.kind.update ``` @@ -122,7 +125,7 @@ the `default` namespace if the cluster is private/personal). The following will deploy all the infrastructure-level requirements (e.g. CRDs, Operators, etc) to support the namespace-level development environments: -```console +```bash make environment.dev.kubernetes.infrastructure ``` @@ -140,7 +143,7 @@ To deploy a development environment to the cluster you'll need to explicitly provide a namespace. This can be `default` if this is your personal cluster, but on a shared cluster you should pick something unique. For example: -```console +```bash export NAMESPACE=annas-dev-environment ``` @@ -149,10 +152,18 @@ export NAMESPACE=annas-dev-environment Create the namespace: -```console +```bash kubectl create namespace ${NAMESPACE} ``` +Set the default namespace for kubectl commands + +```bash +kubectl config set-context --current --namespace="${NAMESPACE}" +``` + +> NOTE: If you are using OpenShift (oc CLI), use the following instead: `oc project "${NAMESPACE}"` + You'll need to provide a `Secret` with the login credentials for your private repository (e.g. quay.io). It should look something like this: @@ -168,51 +179,115 @@ type: kubernetes.io/dockerconfigjson Apply that to your namespace: -```console -kubectl -n ${NAMESPACE} apply -f secret.yaml +```bash +kubectl apply -f secret.yaml ``` Export the name of the `Secret` to the environment: -```console +```bash export REGISTRY_SECRET=anna-pull-secret ``` -Now you need to provide several other environment variables. You'll need to -indicate the location and tag of the `vllm-sim` image: +Set the `VLLM_MODE` environment variable based on which version of vLLM you want to deploy: -```console -export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>" -export VLLM_SIM_TAG="<YOUR_TAG>" +* `vllm-sim`: Lightweight simulator for simple environments (default). +* `vllm`: Full vLLM model server, using GPU/CPU for inferencing +* `vllm-p2p`: Full vLLM with LMCache P2P support for enable KV-Cache aware routing + +```bash +export VLLM_MODE=vllm-sim # or vllm / vllm-p2p ``` -The same thing will need to be done for the EPP: +- Set Hugging Face token variable: -```console -export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>" -export EPP_TAG="<YOUR_TAG>" +```bash +export HF_TOKEN="<HF_TOKEN>" ``` +**Warning**: For vllm mode, the default image uses llama3-8b. Make sure you have permission to access these files in their respective repositories. + +**Note:** The model can be replaced. See [Environment Configuration](#environment-configuration) for model settings. + Once all this is set up, you can deploy the environment: -```console +```bash make environment.dev.kubernetes ``` This will deploy the entire stack to whatever namespace you chose. You can test by exposing the inference `Gateway` via port-forward: -```console -kubectl -n ${NAMESPACE} port-forward service/inference-gateway-istio 8080:80 +```bash +kubectl port-forward service/inference-gateway 8080:80 ``` And making requests with `curl`: -```console +**vllm-sim:** + +```bash curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \ -d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jq ``` +**vllm or vllm-p2p:** + +```bash +curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \ + -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"hi","max_tokens":10,"temperature":0}' | jq +``` + +#### Environment Configurateion + +**1. Setting the EPP image and tag:** + +You can optionally set a custom EPP image (otherwise, the default will be used): + +```bash +export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>" +export EPP_TAG="<YOUR_TAG>" +``` + +**2. Setting the vLLM image and tag:** + +Each vLLM mode has default image values, but you can override them: + +For `vllm-sim` mode:** + +```bash +export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>" +export VLLM_SIM_TAG="<YOUR_TAG>" +``` + +For `vllm` and `vllm-p2p` modes:** + +```bash +export VLLM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>" +export VLLM_TAG="<YOUR_TAG>" +``` + +**3. Setting the model name and label:** + +You can replace the model name that will be used in the system. + +```bash +export MODEL_NAME="${MODEL_NAME:-mistralai/Mistral-7B-Instruct-v0.2}" +export MODEL_LABEL="${MODEL_LABEL:-mistral7b}" +``` + +It is also recommended to update the inference pool name accordingly so that it aligns with the models: + +```bash +export POOL_NAME="${POOL_NAME:-vllm-Mistral-7B-Instruct}" +``` + +**4. Additional environment settings:** + +More Setting of environment variables can be found in the `scripts/kubernetes-dev-env.sh`. + + + #### Development Cycle > **WARNING**: This is a very manual process at the moment. We expect to make @@ -221,19 +296,19 @@ curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: applicati Make your changes locally and commit them. Then select an image tag based on the `git` SHA: -```console +```bash export EPP_TAG=$(git rev-parse HEAD) ``` Build the image: -```console +```bash DEV_VERSION=$EPP_TAG make image-build ``` Tag the image for your private registry and push it: -```console +```bash $CONTAINER_RUNTIME tag quay.io/vllm-d/gateway-api-inference-extension/epp:$TAG \ <MY_REGISTRY>/<MY_IMAGE>:$EPP_TAG $CONTAINER_RUNTIME push <MY_REGISTRY>/<MY_IMAGE>:$EPP_TAG @@ -245,7 +320,7 @@ $CONTAINER_RUNTIME push <MY_REGISTRY>/<MY_IMAGE>:$EPP_TAG Then you can re-deploy the environment with the new changes (don't forget all the required env vars): -```console +```bash make environment.dev.kubernetes ``` diff --git a/Makefile b/Makefile index 641d6cf6..471e95a9 100644 --- a/Makefile +++ b/Makefile @@ -784,11 +784,8 @@ environment.dev.kubernetes: check-kubectl check-kustomize check-envsubst # ------------------------------------------------------------------------------ .PHONY: clean.environment.dev.kubernetes clean.environment.dev.kubernetes: check-kubectl check-kustomize check-envsubst -ifndef NAMESPACE - $(error "Error: NAMESPACE is required but not set") -endif - @echo "INFO: cleaning up dev environment in $(NAMESPACE)" - kustomize build deploy/environments/dev/kubernetes-kgateway | envsubst | kubectl -n "${NAMESPACE}" delete -f - + @CLEAN=true ./scripts/kubernetes-dev-env.sh 2>&1 + @echo "INFO: Finished cleanup of development environment for $(VLLM_MODE) mode in namespace $(NAMESPACE)" # ----------------------------------------------------------------------------- # TODO: these are old aliases that we still need for the moment, but will be diff --git a/deploy/components/inference-gateway/deployments.yaml b/deploy/components/inference-gateway/deployments.yaml index 0fc19d4d..afff8fd2 100644 --- a/deploy/components/inference-gateway/deployments.yaml +++ b/deploy/components/inference-gateway/deployments.yaml @@ -22,7 +22,7 @@ spec: imagePullPolicy: IfNotPresent args: - -poolName - - "vllm-llama3-8b-instruct" + - "${POOL_NAME}" - -v - "4" - --zap-encoder diff --git a/deploy/components/inference-gateway/httproutes.yaml b/deploy/components/inference-gateway/httproutes.yaml index 1115d13d..97eb2cf3 100644 --- a/deploy/components/inference-gateway/httproutes.yaml +++ b/deploy/components/inference-gateway/httproutes.yaml @@ -13,7 +13,7 @@ spec: backendRefs: - group: inference.networking.x-k8s.io kind: InferencePool - name: vllm-llama3-8b-instruct + name: ${POOL_NAME} port: 8000 timeouts: request: 30s diff --git a/deploy/components/inference-gateway/inference-models.yaml b/deploy/components/inference-gateway/inference-models.yaml index 12a51394..869be700 100644 --- a/deploy/components/inference-gateway/inference-models.yaml +++ b/deploy/components/inference-gateway/inference-models.yaml @@ -6,7 +6,17 @@ spec: modelName: food-review criticality: Critical poolRef: - name: vllm-llama3-8b-instruct + name: ${POOL_NAME} targetModels: - name: food-review weight: 100 +--- +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferenceModel +metadata: + name: base-model +spec: + modelName: ${MODEL_NAME} + criticality: Critical + poolRef: + name: ${POOL_NAME} diff --git a/deploy/components/inference-gateway/inference-pools.yaml b/deploy/components/inference-gateway/inference-pools.yaml index ece6e500..3a981a14 100644 --- a/deploy/components/inference-gateway/inference-pools.yaml +++ b/deploy/components/inference-gateway/inference-pools.yaml @@ -1,10 +1,10 @@ apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: - name: vllm-llama3-8b-instruct + name: ${POOL_NAME} spec: targetPortNumber: 8000 selector: - app: vllm-llama3-8b-instruct + app: ${POOL_NAME} extensionRef: name: endpoint-picker diff --git a/deploy/components/vllm-p2p/kustomization.yaml b/deploy/components/vllm-p2p/kustomization.yaml new file mode 100644 index 00000000..1b4c0b28 --- /dev/null +++ b/deploy/components/vllm-p2p/kustomization.yaml @@ -0,0 +1,32 @@ +# ------------------------------------------------------------------------------ +# vLLM P2P Deployment +# +# This deploys the full vLLM model server, capable of serving real models such +# as Llama 3.1-8B-Instruct via the OpenAI-compatible API. It is intended for +# environments with GPU resources and where full inference capabilities are +# required. +# in additon it add LMcache a LLM serving engine extension using Redis to vLLM image +# +# The deployment can be customized using environment variables to set: +# - The container image and tag (VLLM_IMAGE, VLLM_TAG) +# - The model to load (MODEL_NAME) +# +# This setup is suitable for testing on Kubernetes (including +# GPU-enabled nodes or clusters with scheduling for `nvidia.com/gpu`). +# ----------------------------------------------------------------------------- +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +resources: + - vllm-deployment.yaml + - redis-deployment.yaml + - redis-service.yaml + - secret.yaml + +images: + - name: vllm/vllm-openai + newName: ${VLLM_IMAGE} + newTag: ${VLLM_TAG} + - name: redis + newName: ${REDIS_IMAGE} + newTag: ${REDIS_TAG} diff --git a/deploy/components/vllm-p2p/redis-deployment.yaml b/deploy/components/vllm-p2p/redis-deployment.yaml new file mode 100644 index 00000000..31b329e4 --- /dev/null +++ b/deploy/components/vllm-p2p/redis-deployment.yaml @@ -0,0 +1,50 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ${REDIS_DEPLOYMENT_NAME} + labels: + app.kubernetes.io/name: redis + app.kubernetes.io/component: redis-lookup-server +spec: + replicas: 1 + selector: + matchLabels: + app.kubernetes.io/name: redis + app.kubernetes.io/component: redis-lookup-server + template: + metadata: + labels: + app.kubernetes.io/name: redis + app.kubernetes.io/component: redis-lookup-server + spec: + containers: + - name: lookup-server + image: ${REDIS_IMAGE}:${REDIS_TAG} + imagePullPolicy: IfNotPresent + command: + - redis-server + ports: + - name: redis-port + containerPort: ${REDIS_TARGET_PORT} + protocol: TCP + resources: + limits: + cpu: "4" + memory: 10G + requests: + cpu: "4" + memory: 8G + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + restartPolicy: Always + terminationGracePeriodSeconds: 30 + dnsPolicy: ClusterFirst + securityContext: {} + schedulerName: default-scheduler + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 25% + maxSurge: 25% + revisionHistoryLimit: 10 + progressDeadlineSeconds: 600 diff --git a/deploy/components/vllm-p2p/redis-service.yaml b/deploy/components/vllm-p2p/redis-service.yaml new file mode 100644 index 00000000..a5d5fd00 --- /dev/null +++ b/deploy/components/vllm-p2p/redis-service.yaml @@ -0,0 +1,17 @@ +apiVersion: v1 +kind: Service +metadata: + name: ${REDIS_SVC_NAME} + labels: + app.kubernetes.io/name: redis + app.kubernetes.io/component: redis-lookup-server +spec: + ports: + - name: lookupserver-port + protocol: TCP + port: ${REDIS_PORT} + targetPort: ${REDIS_TARGET_PORT} + type: ${REDIS_SERVICE_TYPE} + selector: + app.kubernetes.io/name: redis + app.kubernetes.io/component: redis-lookup-server diff --git a/deploy/components/vllm-p2p/secret.yaml b/deploy/components/vllm-p2p/secret.yaml new file mode 100644 index 00000000..23fe9473 --- /dev/null +++ b/deploy/components/vllm-p2p/secret.yaml @@ -0,0 +1,10 @@ +apiVersion: v1 +kind: Secret +metadata: + name: ${HF_SECRET_NAME} + labels: + app.kubernetes.io/name: vllm + app.kubernetes.io/component: secret +type: Opaque +data: + ${HF_SECRET_KEY}: ${HF_TOKEN} diff --git a/deploy/components/vllm-p2p/vllm-deployment.yaml b/deploy/components/vllm-p2p/vllm-deployment.yaml new file mode 100644 index 00000000..19fd59c2 --- /dev/null +++ b/deploy/components/vllm-p2p/vllm-deployment.yaml @@ -0,0 +1,118 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ${VLLM_DEPLOYMENT_NAME} + labels: + app.kubernetes.io/name: vllm + app.kubernetes.io/model: ${MODEL_LABEL} + app.kubernetes.io/component: vllm +spec: + replicas: ${VLLM_REPLICA_COUNT} + selector: + matchLabels: + app.kubernetes.io/name: vllm + app.kubernetes.io/component: vllm + app.kubernetes.io/model: ${MODEL_LABEL} + app: ${POOL_NAME} + template: + metadata: + labels: + app.kubernetes.io/name: vllm + app.kubernetes.io/component: vllm + app.kubernetes.io/model: ${MODEL_LABEL} + app: ${POOL_NAME} + spec: + containers: + - name: vllm + image: ${VLLM_IMAGE}:${VLLM_TAG} + imagePullPolicy: IfNotPresent + command: + - /bin/sh + - "-c" + args: + - | + export LMCACHE_DISTRIBUTED_URL=$${${POD_IP}}:80 && \ + vllm serve ${MODEL_NAME} \ + --host 0.0.0.0 \ + --port 8000 \ + --enable-chunked-prefill false \ + --max-model-len ${MAX_MODEL_LEN} \ + --kv-transfer-config '{"kv_connector":"LMCacheConnector","kv_role":"kv_both"}' + ports: + - name: http + containerPort: 8000 + protocol: TCP + - name: lmcache-dist # Assuming port 80 is used for LMCACHE_DISTRIBUTED_URL + containerPort: 80 + protocol: TCP + livenessProbe: + failureThreshold: 3 + httpGet: + path: /health + port: 8000 + scheme: HTTP + initialDelaySeconds: 15 + periodSeconds: 10 + successThreshold: 1 + timeoutSeconds: 1 + startupProbe: + failureThreshold: 60 + httpGet: + path: /health + port: 8000 + scheme: HTTP + initialDelaySeconds: 15 + periodSeconds: 10 + successThreshold: 1 + timeoutSeconds: 1 + env: + - name: HF_HOME + value: /data + - name: POD_IP + valueFrom: + fieldRef: + apiVersion: v1 + fieldPath: status.podIP + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: ${HF_SECRET_NAME} + key: ${HF_SECRET_KEY} + - name: LMCACHE_LOOKUP_URL + value: ${REDIS_HOST}:${REDIS_PORT} + - name: LMCACHE_ENABLE_DEBUG + value: "True" + - name: LMCACHE_ENABLE_P2P + value: "True" + - name: LMCACHE_LOCAL_CPU + value: "True" + - name: LMCACHE_MAX_LOCAL_CPU_SIZE + value: "20" + - name: LMCACHE_USE_EXPERIMENTAL + value: "True" + - name: VLLM_RPC_TIMEOUT + value: "1000000" + resources: + limits: + nvidia.com/gpu: "1" + requests: + cpu: "${VLLM_CPU_RESOURCES}" + memory: 40Gi + nvidia.com/gpu: "1" + terminationMessagePath: /dev/termination-log + terminationMessagePolicy: File + securityContext: + runAsNonRoot: false + restartPolicy: Always + terminationGracePeriodSeconds: 30 + dnsPolicy: ClusterFirst + securityContext: {} + schedulerName: default-scheduler + strategy: + type: RollingUpdate + rollingUpdate: + maxUnavailable: 0 + maxSurge: "100%" + revisionHistoryLimit: 10 + progressDeadlineSeconds: 1200 + diff --git a/deploy/components/vllm-sim/deployments.yaml b/deploy/components/vllm-sim/deployments.yaml index 4673a99c..34b742c2 100644 --- a/deploy/components/vllm-sim/deployments.yaml +++ b/deploy/components/vllm-sim/deployments.yaml @@ -3,16 +3,16 @@ kind: Deployment metadata: name: vllm-sim labels: - app: vllm-llama3-8b-instruct + app: ${POOL_NAME} spec: replicas: 1 selector: matchLabels: - app: vllm-llama3-8b-instruct + app: ${POOL_NAME} template: metadata: labels: - app: vllm-llama3-8b-instruct + app: ${POOL_NAME} ai-aware-router-pod: "true" spec: containers: diff --git a/deploy/components/vllm/configmap.yaml b/deploy/components/vllm/configmap.yaml new file mode 100644 index 00000000..03019ce1 --- /dev/null +++ b/deploy/components/vllm/configmap.yaml @@ -0,0 +1,14 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: lora-adapters +data: + configmap.yaml: | + vLLMLoRAConfig: + name: lora-adapters + port: 8000 + defaultBaseModel: ${MODEL_NAME} + ensureExist: + models: + - id: food-review-1 + source: Kawon/llama3.1-food-finetune_v14_r8 diff --git a/deploy/components/vllm/deployments.yaml b/deploy/components/vllm/deployments.yaml new file mode 100644 index 00000000..71eaa72c --- /dev/null +++ b/deploy/components/vllm/deployments.yaml @@ -0,0 +1,133 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ${VLLM_DEPLOYMENT_NAME} +spec: + replicas: ${VLLM_REPLICA_COUNT} + selector: + matchLabels: + app: ${POOL_NAME} + template: + metadata: + labels: + app: ${POOL_NAME} + spec: + securityContext: + runAsUser: ${PROXY_UID} + runAsNonRoot: true + seccompProfile: + type: RuntimeDefault + containers: + - name: vllm + image: "vllm/vllm-openai:latest" + imagePullPolicy: IfNotPresent + command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] + args: + - "--model" + - "${MODEL_NAME}" + - "--tensor-parallel-size" + - "1" + - "--port" + - "8000" + - "--max-num-seq" + - "1024" + - "--compilation-config" + - "3" + - "--enable-lora" + - "--max-loras" + - "2" + - "--max-lora-rank" + - "8" + - "--max-cpu-loras" + - "12" + env: + - name: VLLM_USE_V1 + value: "1" + - name: PORT + value: "8000" + - name: HUGGING_FACE_HUB_TOKEN + valueFrom: + secretKeyRef: + name: ${HF_SECRET_NAME} + key: ${HF_SECRET_KEY} + - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING + value: "true" + - name: XDG_CACHE_HOME + value: /cache + - name: HF_HOME + value: /cache/huggingface + - name: FLASHINFER_CACHE_DIR + value: /cache/flashinfer + ports: + - containerPort: 8000 + name: http + protocol: TCP + lifecycle: + preStop: + sleep: + seconds: 30 + livenessProbe: + httpGet: + path: /health + port: http + scheme: HTTP + periodSeconds: 1 + successThreshold: 1 + failureThreshold: 5 + timeoutSeconds: 1 + readinessProbe: + httpGet: + path: /health + port: http + scheme: HTTP + periodSeconds: 1 + successThreshold: 1 + failureThreshold: 1 + timeoutSeconds: 1 + startupProbe: + httpGet: + path: /health + port: http + scheme: HTTP + failureThreshold: 600 + initialDelaySeconds: 2 + periodSeconds: 1 + resources: + limits: + nvidia.com/gpu: 1 + requests: + nvidia.com/gpu: 1 + volumeMounts: + - mountPath: /cache + name: hf-cache + - mountPath: /dev/shm + name: shm + - mountPath: /adapters + name: adapters + initContainers: + - name: lora-adapter-syncer + tty: true + stdin: true + image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main + restartPolicy: Always + imagePullPolicy: Always + env: + - name: DYNAMIC_LORA_ROLLOUT_CONFIG + value: "/config/configmap.yaml" + volumeMounts: + - name: config-volume + mountPath: /config + restartPolicy: Always + enableServiceLinks: false + terminationGracePeriodSeconds: 130 + volumes: + - name: hf-cache + emptyDir: {} + - name: shm + emptyDir: + medium: Memory + - name: adapters + emptyDir: {} + - name: config-volume + configMap: + name: lora-adapters diff --git a/deploy/components/vllm/kustomization.yaml b/deploy/components/vllm/kustomization.yaml new file mode 100644 index 00000000..6e0da28b --- /dev/null +++ b/deploy/components/vllm/kustomization.yaml @@ -0,0 +1,36 @@ +# ------------------------------------------------------------------------------ +# vLLM Deployment +# +# This deploys the full vLLM model server, capable of serving real models such +# as Llama 3.1-8B-Instruct via the OpenAI-compatible API. It is intended for +# environments with GPU resources and where full inference capabilities are +# required. +# +# The deployment can be customized using environment variables to set: +# - The container image and tag (VLLM_IMAGE, VLLM_TAG) +# - The model to load (MODEL_NAME) +# +# This setup is suitable for testing on Kubernetes (including +# GPU-enabled nodes or clusters with scheduling for `nvidia.com/gpu`). +# ----------------------------------------------------------------------------- +kind: Kustomization + +resources: +- deployments.yaml +- secret.yaml +- configmap.yaml + + +images: +- name: vllm/vllm-openai + newName: ${VLLM_IMAGE} + newTag: ${VLLM_TAG} + +- name: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer + newName: ${LORA_ADAPTER_SYNCER_IMAGE} + newTag: ${LORA_ADAPTER_SYNCER_TAG} + +configMapGenerator: +- name: vllm-model-config + literals: + - MODEL_NAME=${MODEL_NAME} diff --git a/deploy/components/vllm/secret.yaml b/deploy/components/vllm/secret.yaml new file mode 100644 index 00000000..23fe9473 --- /dev/null +++ b/deploy/components/vllm/secret.yaml @@ -0,0 +1,10 @@ +apiVersion: v1 +kind: Secret +metadata: + name: ${HF_SECRET_NAME} + labels: + app.kubernetes.io/name: vllm + app.kubernetes.io/component: secret +type: Opaque +data: + ${HF_SECRET_KEY}: ${HF_TOKEN} diff --git a/deploy/environments/dev/kind-istio/patch-deployments.yaml b/deploy/environments/dev/kind-istio/patch-deployments.yaml index 874b287c..7ab6e3ad 100644 --- a/deploy/environments/dev/kind-istio/patch-deployments.yaml +++ b/deploy/environments/dev/kind-istio/patch-deployments.yaml @@ -9,7 +9,7 @@ spec: - name: epp args: - -poolName - - "vllm-llama3-8b-instruct" + - ${POOL_NAME} - -poolNamespace - "default" - -v diff --git a/deploy/environments/dev/kind-kgateway/patch-deployments.yaml b/deploy/environments/dev/kind-kgateway/patch-deployments.yaml index 874b287c..7ab6e3ad 100644 --- a/deploy/environments/dev/kind-kgateway/patch-deployments.yaml +++ b/deploy/environments/dev/kind-kgateway/patch-deployments.yaml @@ -9,7 +9,7 @@ spec: - name: epp args: - -poolName - - "vllm-llama3-8b-instruct" + - ${POOL_NAME} - -poolNamespace - "default" - -v diff --git a/deploy/environments/dev/kubernetes-istio/patch-deployments.yaml b/deploy/environments/dev/kubernetes-istio/patch-deployments.yaml index 20a17d53..a5a721b8 100644 --- a/deploy/environments/dev/kubernetes-istio/patch-deployments.yaml +++ b/deploy/environments/dev/kubernetes-istio/patch-deployments.yaml @@ -11,7 +11,7 @@ spec: - name: epp args: - -poolName - - "vllm-llama3-8b-instruct" + - ${POOL_NAME} - -poolNamespace - ${NAMESPACE} - -v diff --git a/deploy/environments/dev/kubernetes-kgateway/gateway-parameters.yaml b/deploy/environments/dev/kubernetes-kgateway/gateway-parameters.yaml index 3461a596..da2d91d2 100644 --- a/deploy/environments/dev/kubernetes-kgateway/gateway-parameters.yaml +++ b/deploy/environments/dev/kubernetes-kgateway/gateway-parameters.yaml @@ -3,7 +3,7 @@ kind: GatewayParameters metadata: name: custom-gw-params spec: - kube: + kube: envoyContainer: securityContext: allowPrivilegeEscalation: false @@ -11,12 +11,12 @@ spec: runAsNonRoot: true runAsUser: "${PROXY_UID}" service: - type: NodePort + type: ${GATEWAY_SERVICE_TYPE} extraLabels: gateway: custom podTemplate: extraLabels: gateway: custom - securityContext: + securityContext: seccompProfile: type: RuntimeDefault diff --git a/deploy/environments/dev/kubernetes-kgateway/kustomization.yaml b/deploy/environments/dev/kubernetes-kgateway/kustomization.yaml index 0b7e1ed8..293119e2 100644 --- a/deploy/environments/dev/kubernetes-kgateway/kustomization.yaml +++ b/deploy/environments/dev/kubernetes-kgateway/kustomization.yaml @@ -4,14 +4,11 @@ kind: Kustomization namespace: ${NAMESPACE} resources: -- ../../../components/vllm-sim/ +- secret.yaml - ../../../components/inference-gateway/ - gateway-parameters.yaml images: -- name: quay.io/vllm-d/vllm-sim - newName: ${VLLM_SIM_IMAGE} - newTag: ${VLLM_SIM_TAG} - name: quay.io/vllm-d/gateway-api-inference-extension/epp newName: ${EPP_IMAGE} newTag: ${EPP_TAG} diff --git a/deploy/environments/dev/kubernetes-kgateway/patch-deployments.yaml b/deploy/environments/dev/kubernetes-kgateway/patch-deployments.yaml index 20a17d53..00c87fbb 100644 --- a/deploy/environments/dev/kubernetes-kgateway/patch-deployments.yaml +++ b/deploy/environments/dev/kubernetes-kgateway/patch-deployments.yaml @@ -11,7 +11,7 @@ spec: - name: epp args: - -poolName - - "vllm-llama3-8b-instruct" + - ${POOL_NAME} - -poolNamespace - ${NAMESPACE} - -v @@ -22,13 +22,11 @@ spec: - "9002" - -grpcHealthPort - "9003" ---- -apiVersion: apps/v1 -kind: Deployment -metadata: - name: vllm-sim -spec: - template: - spec: - imagePullSecrets: - - name: ${REGISTRY_SECRET} + env: + - name: KVCACHE_INDEXER_REDIS_ADDR + value: ${REDIS_HOST}:${REDIS_PORT} + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: hf-token + key: ${HF_SECRET_KEY} \ No newline at end of file diff --git a/deploy/environments/dev/kubernetes-kgateway/secret.yaml b/deploy/environments/dev/kubernetes-kgateway/secret.yaml new file mode 100644 index 00000000..23fe9473 --- /dev/null +++ b/deploy/environments/dev/kubernetes-kgateway/secret.yaml @@ -0,0 +1,10 @@ +apiVersion: v1 +kind: Secret +metadata: + name: ${HF_SECRET_NAME} + labels: + app.kubernetes.io/name: vllm + app.kubernetes.io/component: secret +type: Opaque +data: + ${HF_SECRET_KEY}: ${HF_TOKEN} diff --git a/deploy/environments/dev/kubernetes-vllm/vllm-p2p/kustomization.yaml b/deploy/environments/dev/kubernetes-vllm/vllm-p2p/kustomization.yaml new file mode 100644 index 00000000..2d378312 --- /dev/null +++ b/deploy/environments/dev/kubernetes-vllm/vllm-p2p/kustomization.yaml @@ -0,0 +1,13 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +resources: +- ../../../../components/vllm-p2p/ + +images: +- name: quay.io/vllm-d/vllm-d-dev:0.0.2 + newName: ${VLLM_IMAGE} + newTag: ${VLLM_TAG} + +patches: + - path: patch-deployments.yaml diff --git a/deploy/environments/dev/kubernetes-vllm/vllm-p2p/patch-deployments.yaml b/deploy/environments/dev/kubernetes-vllm/vllm-p2p/patch-deployments.yaml new file mode 100644 index 00000000..b1afb13e --- /dev/null +++ b/deploy/environments/dev/kubernetes-vllm/vllm-p2p/patch-deployments.yaml @@ -0,0 +1,9 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ${VLLM_DEPLOYMENT_NAME} +spec: + template: + spec: + imagePullSecrets: + - name: ${REGISTRY_SECRET} diff --git a/deploy/environments/dev/kubernetes-vllm/vllm-sim/kustomization.yaml b/deploy/environments/dev/kubernetes-vllm/vllm-sim/kustomization.yaml new file mode 100644 index 00000000..a45ae271 --- /dev/null +++ b/deploy/environments/dev/kubernetes-vllm/vllm-sim/kustomization.yaml @@ -0,0 +1,13 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +resources: +- ../../../../components/vllm-sim/ + +images: +- name: quay.io/vllm-d/vllm-sim + newTag: ${VLLM_SIM_TAG} + +patches: + - path: patch-deployments.yaml + diff --git a/deploy/environments/dev/kubernetes-vllm/vllm-sim/patch-deployments.yaml b/deploy/environments/dev/kubernetes-vllm/vllm-sim/patch-deployments.yaml new file mode 100644 index 00000000..dbb99b17 --- /dev/null +++ b/deploy/environments/dev/kubernetes-vllm/vllm-sim/patch-deployments.yaml @@ -0,0 +1,10 @@ + +apiVersion: apps/v1 +kind: Deployment +metadata: + name: vllm-sim +spec: + template: + spec: + imagePullSecrets: + - name: ${REGISTRY_SECRET} diff --git a/deploy/environments/dev/kubernetes-vllm/vllm/kustomization.yaml b/deploy/environments/dev/kubernetes-vllm/vllm/kustomization.yaml new file mode 100644 index 00000000..e512ee89 --- /dev/null +++ b/deploy/environments/dev/kubernetes-vllm/vllm/kustomization.yaml @@ -0,0 +1,17 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +resources: +- ../../../../components/vllm/ + +images: +- name: quay.io/vllm-d/vllm-d-dev + newName: ${VLLM_IMAGE} + newTag: ${VLLM_TAG} + +- name: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer + newName: ${LORA_ADAPTER_SYNCER_IMAGE} + newTag: ${LORA_ADAPTER_SYNCER_TAG} + +patches: + - path: patch-deployments.yaml diff --git a/deploy/environments/dev/kubernetes-vllm/vllm/patch-deployments.yaml b/deploy/environments/dev/kubernetes-vllm/vllm/patch-deployments.yaml new file mode 100644 index 00000000..b1afb13e --- /dev/null +++ b/deploy/environments/dev/kubernetes-vllm/vllm/patch-deployments.yaml @@ -0,0 +1,9 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: ${VLLM_DEPLOYMENT_NAME} +spec: + template: + spec: + imagePullSecrets: + - name: ${REGISTRY_SECRET} diff --git a/scripts/kind-dev-env.sh b/scripts/kind-dev-env.sh index e40847e0..85cd988e 100755 --- a/scripts/kind-dev-env.sh +++ b/scripts/kind-dev-env.sh @@ -25,6 +25,11 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" # Set the host port to map to the Gateway's inbound port (30080) : "${GATEWAY_HOST_PORT:=30080}" +# Set the inference pool name for the deployment +export POOL_NAME="${POOL_NAME:-vllm-llama3-8b-instruct}" + +# Set the model name to deploy +export MODEL_NAME="${MODEL_NAME:-meta-llama/Llama-3.1-8B-Instruct}" # ------------------------------------------------------------------------------ # Setup & Requirement Checks # ------------------------------------------------------------------------------ @@ -113,7 +118,7 @@ kustomize build --enable-helm deploy/components/crds-kgateway | # Deploy the environment to the "default" namespace kustomize build --enable-helm deploy/environments/dev/kind-kgateway \ - | sed "s/REPLACE_NAMESPACE/${PROJECT_NAMESPACE}/gI" \ + | envsubst | sed "s/REPLACE_NAMESPACE/${PROJECT_NAMESPACE}/gI" \ | kubectl --context ${KUBE_CONTEXT} apply -f - # Wait for all control-plane pods to be ready diff --git a/scripts/kubernetes-dev-env.sh b/scripts/kubernetes-dev-env.sh index 28b84409..62027c69 100755 --- a/scripts/kubernetes-dev-env.sh +++ b/scripts/kubernetes-dev-env.sh @@ -12,18 +12,81 @@ set -eux # ------------------------------------------------------------------------------ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" - -# Set a default VLLM_SIM_IMAGE if not provided -: "${VLLM_SIM_IMAGE:=quay.io/vllm-d/vllm-sim}" - -# Set a default VLLM_SIM_TAG if not provided -: "${VLLM_SIM_TAG:=0.0.2}" - -# Set a default EPP_IMAGE if not provided -: "${EPP_IMAGE:=us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp}" - -# Set a default EPP_TAG if not provided -: "${EPP_TAG:=main}" +export CLEAN="${CLEAN:-false}" + +# Validate required inputs +if [[ -z "${NAMESPACE:-}" ]]; then + echo "ERROR: NAMESPACE environment variable is not set." + exit 1 +fi + + +# GIE Configuration +export POOL_NAME="${POOL_NAME:-vllm-llama3-8b-instruct}" +export MODEL_NAME="${MODEL_NAME:-meta-llama/Llama-3.1-8B-Instruct}" +export GATEWAY_SERVICE_TYPE="${GATEWAY_SERVICE_TYPE:-NodePort}" + +## EPP ENV VARs — currently added to all EPPs, regardless of the VLLM mode or whether they are actually needed +export REDIS_DEPLOYMENT_NAME="${REDIS_DEPLOYMENT_NAME:-lookup-server-service}" +export REDIS_SVC_NAME="${REDIS_SVC_NAME:-${REDIS_DEPLOYMENT_NAME}}" +export REDIS_HOST="${REDIS_HOST:-${REDIS_SVC_NAME}.${NAMESPACE}.svc.cluster.local}" +export REDIS_PORT="${REDIS_PORT:-8100}" +export HF_TOKEN=$(echo -n "${HF_TOKEN}" | base64 | tr -d '\n') +export HF_SECRET_NAME="${HF_SECRET_NAME:-hf-token}" +export HF_SECRET_KEY="${HF_SECRET_KEY:-token}" +# vLLM Specific Configuration node +export VLLM_MODE="${VLLM_MODE:-vllm-sim}" + +case "${VLLM_MODE}" in + vllm-sim) + export VLLM_SIM_IMAGE="${VLLM_SIM_IMAGE:-quay.io/vllm-d/vllm-sim}" + export VLLM_SIM_TAG="${VLLM_SIM_TAG:-0.0.2}" + export EPP_IMAGE="${EPP_IMAGE:-us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp}" + export EPP_TAG="${EPP_TAG:-main}" + export HF_TOKEN=$(echo -n "dummy-token" | base64 | tr -d '\n') + ;; + vllm | vllm-p2p) + # Shared across both full model modes - // TODO - make more env variables similar + # TODO: Consider unifying more environment variables for consistency and reuse + + export VOLUME_MOUNT_PATH="${VOLUME_MOUNT_PATH:-/data}" + export VLLM_REPLICA_COUNT="${VLLM_REPLICA_COUNT:-3}" + export MODEL_LABEL="${MODEL_LABEL:-llama3-8b}" + export VLLM_DEPLOYMENT_NAME="${VLLM_DEPLOYMENT_NAME:-vllm-${MODEL_LABEL}}" + + if [[ "$VLLM_MODE" == "vllm" ]]; then + export VLLM_IMAGE="${VLLM_IMAGE:-quay.io/vllm-d/vllm-d-dev}" + export VLLM_TAG="${VLLM_TAG:-0.0.2}" + export EPP_IMAGE="${EPP_IMAGE:-quay.io/vllm-d/gateway-api-inference-extension-dev}" + export EPP_TAG="${EPP_TAG:-0.0.4}" + export MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}" + export PVC_NAME="${PVC_NAME:-vllm-storage-claim}" + export LORA_ADAPTER_SYNCER_IMAGE="${LORA_ADAPTER_SYNCER_IMAGE:-us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer}" + export LORA_ADAPTER_SYNCER_TAG="${LORA_ADAPTER_SYNCER_TAG:-v20250425-ddc3d69}" + + elif [[ "$VLLM_MODE" == "vllm-p2p" ]]; then + export VLLM_IMAGE="${VLLM_IMAGE:-lmcache/vllm-openai}" + export VLLM_TAG="${VLLM_TAG:-2025-03-10}" + export EPP_IMAGE="${EPP_IMAGE:- quay.io/vmaroon/gateway-api-inference-extension/epp}" + export EPP_TAG="${EPP_TAG:-kv-aware}" + export MAX_MODEL_LEN="${MAX_MODEL_LEN:-32768}" + export PVC_NAME="${PVC_NAME:-vllm-p2p-storage-claim}" + export PVC_ACCESS_MODE="${PVC_ACCESS_MODE:-ReadWriteOnce}" + export PVC_SIZE="${PVC_SIZE:-10Gi}" + export PVC_STORAGE_CLASS="${PVC_STORAGE_CLASS:-standard}" + export REDIS_IMAGE="${REDIS_IMAGE:-redis}" + export REDIS_TAG="${REDIS_TAG:-7.2.3}" + export VLLM_CPU_RESOURCES="${VLLM_CPU_RESOURCES:-10}" + export POD_IP="POD_IP" + export REDIS_TARGET_PORT="${REDIS_TARGET_PORT:-6379}" + export REDIS_SERVICE_TYPE="${REDIS_SERVICE_TYPE:-ClusterIP}" + fi + ;; + *) + echo "ERROR: Unsupported VLLM_MODE: ${VLLM_MODE}. Must be one of: vllm-sim, vllm, vllm-p2p" + exit 1 + ;; +esac # ------------------------------------------------------------------------------ # Deployment @@ -32,18 +95,39 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" kubectl create namespace ${NAMESPACE} 2>/dev/null || true # Hack to deal with KGateways broken OpenShift support -export PROXY_UID=$(kubectl get namespace ${NAMESPACE} -o json | jq -e -r '.metadata.annotations["openshift.io/sa.scc.uid-range"]' | perl -F'/' -lane 'print $F[0]+1'); +export PROXY_UID=$(kubectl get namespace ${NAMESPACE} -o json | jq -e -r '.metadata.annotations["openshift.io/sa.scc.uid-range"]' | perl -F'/' -lane 'print $F[0]+1'); set -o pipefail -echo "INFO: Deploying Development Environment in namespace ${NAMESPACE}" - -kustomize build deploy/environments/dev/kubernetes-kgateway | envsubst | kubectl -n ${NAMESPACE} apply -f - - -echo "INFO: Waiting for resources in namespace ${NAMESPACE} to become ready" +if [[ "$CLEAN" == "true" ]]; then + echo "INFO: ${CLEAN^^}ING environment in namespace ${NAMESPACE} for mode ${VLLM_MODE}" + kustomize build deploy/environments/dev/kubernetes-kgateway | envsubst | kubectl -n "${NAMESPACE}" delete --ignore-not-found=true -f - + kustomize build deploy/environments/dev/kubernetes-vllm/${VLLM_MODE} | envsubst | kubectl -n "${NAMESPACE}" delete --ignore-not-found=true -f - +else + echo "INFO: Deploying vLLM Environment in namespace ${NAMESPACE}" + oc adm policy add-scc-to-user anyuid -z default -n ${NAMESPACE} + kustomize build deploy/environments/dev/kubernetes-vllm/${VLLM_MODE} | envsubst | kubectl -n "${NAMESPACE}" apply -f - + + echo "INFO: Deploying Gateway Environment in namespace ${NAMESPACE}" + kustomize build deploy/environments/dev/kubernetes-kgateway | envsubst | kubectl -n "${NAMESPACE}" apply -f - + + echo "INFO: Waiting for resources in namespace ${NAMESPACE} to become ready" + kubectl -n "${NAMESPACE}" wait deployment/endpoint-picker --for=condition=Available --timeout=60s + kubectl -n "${NAMESPACE}" wait gateway/inference-gateway --for=condition=Programmed --timeout=60s + kubectl -n "${NAMESPACE}" wait deployment/inference-gateway --for=condition=Available --timeout=60s + # Mode-specific wait + case "${VLLM_MODE}" in + vllm-sim) + kubectl -n "${NAMESPACE}" wait deployment/vllm-sim --for=condition=Available --timeout=60s + ;; + vllm) + kubectl -n "${NAMESPACE}" wait deployment/${VLLM_DEPLOYMENT_NAME} --for=condition=Available --timeout=500s + ;; + vllm-p2p) + kubectl -n "${NAMESPACE}" wait deployment/${VLLM_DEPLOYMENT_NAME} --for=condition=Available --timeout=180s + kubectl -n "${NAMESPACE}" wait deployment/${REDIS_SVC_NAME} --for=condition=Available --timeout=60s + ;; + esac +fi -kubectl -n ${NAMESPACE} wait deployment/endpoint-picker --for=condition=Available --timeout=60s -kubectl -n ${NAMESPACE} wait deployment/vllm-sim --for=condition=Available --timeout=60s -kubectl -n ${NAMESPACE} wait gateway/inference-gateway --for=condition=Programmed --timeout=60s -kubectl -n ${NAMESPACE} wait deployment/inference-gateway --for=condition=Available --timeout=60s