Skip to content

feat: Add scripts for kubernetes dev env using vLLM and vLLM-p2p #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

kfirtoledo
Copy link

Add support for Kubernetes environment development using GIE with KGateway and vLLM
This PR introduces support for the vllm mode, enabling integration testing of GIE with vLLM.
It also adds support for the vllm-p2p mode, which includes:

  1. Deployment of Redis and LMCache alongside the vLLM image
  2. Peer-to-peer (P2P) communication between vLLM instances
  3. Use of the EPP image to enable kv-cache-aware routing

@kfirtoledo kfirtoledo added help wanted Extra attention is needed WIP labels Apr 25, 2025
@kfirtoledo kfirtoledo changed the title feat: add scripts for kubernetes dev env using vLLM and vLLM-p2p feat: Add scripts for kubernetes dev env using vLLM and vLLM-p2p Apr 25, 2025
Copy link
Collaborator

@shaneutt shaneutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great. Most of my comments are smaller, but I do have some questions for other folks as to what effect this will have.

Also, cc @elevran @shmuelk who I think should take a look.

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: mistarli
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: mistarli
name: mistral

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I removed it and use just a base model

Comment on lines 51 to 55
# securityContext:
# allowPrivilegeEscalation: false
# capabilities:
# drop:
# - ALL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's helpful for future readers to see this commented out bit, we should add a comment explaining why. Otherwise we should probably just remove it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 25 to 29
# securityContext:
# runAsUser: ${PROXY_UID}
# runAsNonRoot: true
# seccompProfile:
# type: RuntimeDefault
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above: if this is an important breadcrumb for future readers, let's add a comment explaining why, otherwise if its just leftovers let's remove it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- name: lora-adapter-syncer
tty: true
stdin: true
image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this is a hardcoded image on the main tag, and not something we patch from the kustomization.yaml. Do we want this to be customizable? Or at the very least, do we want to pin to a specific SHA so that updates are more deliberate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +1 to +15
# ------------------------------------------------------------------------------
# vLLM Deployment
#
# This deploys the full vLLM model server, capable of serving real models such
# as Llama 3.1-8B-Instruct via the OpenAI-compatible API. It is intended for
# environments with GPU resources and where full inference capabilities are
# required.
#
# The deployment can be customized using environment variables to set:
# - The container image and tag (VLLM_IMAGE, VLLM_TAG)
# - The model to load (MODEL_NAME)
#
# This setup is suitable for testing and production with Kubernetes (including
# GPU-enabled nodes or clusters with scheduling for `nvidia.com/gpu`).
# -----------------------------------------------------------------------------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -0,0 +1,29 @@
apiVersion: kustomize.config.k8s.io/v1beta1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed we added some documentation on the kustomization.yaml for the vllm component, but not for this one. Perhaps we should add a little comment here explaining how this one differs from the standard one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

type: NodePort
type: LoadBalancer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason NodePort was chosen originally is because it's more universal and in theory this deployment is meant to work generally on Kubernetes clusters. Perhaps we can make this variable, defaulting to NodePort and allowing folks to opt-in to LoadBalancer? LMKWYT? 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, I agree that on Kind we should use nodeport, but this is not for Kind, in a development environment, I think we should use LB. This is also how the GIE uses that in their example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an LB in the test clusters?

Comment on lines -7 to -14
- ../../../components/vllm-sim/
- ../../../components/inference-gateway/
- gateway-parameters.yaml

images:
- name: quay.io/vllm-d/vllm-sim
newName: ${VLLM_SIM_IMAGE}
newTag: ${VLLM_SIM_TAG}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea with the deployments under deploy/environments is that these use the deploy/components to provide some level of a working environment. With this change, this environment wont really have anything that works OOTB, so maybe what we should do is leave the simulator in this one, and rename it deploy/environments/dev/kubernetes-kgateway-vllm-sim?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason I separated vLLM from the gateway is that I believe it will be similar in Istio. I didn’t want to duplicate the files because the vLLM YAMLs shouldn’t depend on which gateway we use.
Another option we could add a third level of abstraction: components -> vllm -> gateway.
It would be cleaner, but possibly also more complicated to understand.

@@ -0,0 +1,11 @@
apiVersion: kustomize.config.k8s.io/v1beta1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok, I see what you're doing with the naming now. The difference now is that any one of these deployments is deploying only a working VLLM stack, and then you have to deploy your inference-gateway stack separately.

cc @tumido @Gregory-Pereira @vMaroon just wanting to check with you on how this will work with your Helm chart?

echo "ERROR: NAMESPACE environment variable is not set."
exit 1
fi
if [[ -z "${VLLM_MODE:-}" ]]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider whether we should just default this to vllm-sim given that we know it's functional on practically any cluster.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but I want people to set it up to be sure on which environment they testing and running

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using vLLM simulator. If you think you want to steer people towards using real vLLM, you can add a print message when set to simulator (e.g., ** INFO: running simulated vLLM instances **

apiVersion: apps/v1
kind: Deployment
metadata:
name: ${REDIS_SVC_NAME}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be turned to REDIS_DEPLOYMENT_NAME or something like that, and in the Service CR add a -service extension. Otherwise the deployment's name will be weird.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

- "-c"
args:
- |
export LMCACHE_DISTRIBUTED_URL=$${${POD_IP}}:80 && \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this syntax correct? Shouldnt it be:

Suggested change
export LMCACHE_DISTRIBUTED_URL=$${${POD_IP}}:80 && \
export LMCACHE_DISTRIBUTED_URL=${POD_IP}:80 && \

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, you are right, the end goal is that it will be ${POD_IP},
But the issue is I use envsubst, so I need to change it to this =$${${POD_IP}} + export POD_IP="POD_IP",
That after I trigger envsubst, I will get ${POD_IP} (took me a lot of time to figure it out)

@kfirtoledo kfirtoledo force-pushed the dev branch 2 times, most recently from 390c50a to b189362 Compare April 26, 2025 23:52
@kfirtoledo
Copy link
Author

@shaneutt , PTOL.

@kfirtoledo kfirtoledo removed help wanted Extra attention is needed WIP labels Apr 27, 2025

The same thing will need to be done for the EPP:
- `vllm-sim`: Lightweight simulator for simple environments
- `vllm`: Full vLLM model server for real inference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `vllm`: Full vLLM model server for real inference
- `vllm`: Full vLLM model server, using GPU/CPU for inferencing

```

- Set hugging face token variable:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Set hugging face token variable:
- Set Hugging Face token variable:

@@ -152,6 +152,13 @@ Create the namespace:
```console
kubectl create namespace ${NAMESPACE}
```
Set the default namespace for kubectl commands

```console
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please be consistent with using bash or console for command line/shell snippets


```console
export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the images being set elsewhere, to match simulator/vLLM/vLLM-p2p?

@@ -202,16 +208,72 @@ make environment.dev.kubernetes
This will deploy the entire stack to whatever namespace you chose. You can test
by exposing the inference `Gateway` via port-forward:

```console
kubectl -n ${NAMESPACE} port-forward service/inference-gateway-istio 8080:80
```bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: console vs bash?

# - The container image and tag (VLLM_IMAGE, VLLM_TAG)
# - The model to load (MODEL_NAME)
#
# This setup is suitable for testing and production with Kubernetes (including
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to stand behind the production qualification? As opposed to testing in real hardware.
I imagine there are ways to configure vLLM optimally for hardware that goes beyond what's done in this PR.

configMapGenerator:
- name: vllm-model-config
literals:
- MODEL_NAME=${MODEL_NAME}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing newline at end of file

type: NodePort
type: LoadBalancer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an LB in the test clusters?

- name: quay.io/vllm-d/gateway-api-inference-extension/epp
newName: ${EPP_IMAGE}
newTag: ${EPP_TAG}

patches:
- path: patch-deployments.yaml
- path: patch-gateways.yaml
- path: patch-gateways.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing newline at end of file

echo "ERROR: NAMESPACE environment variable is not set."
exit 1
fi
if [[ -z "${VLLM_MODE:-}" ]]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using vLLM simulator. If you think you want to steer people towards using real vLLM, you can add a print message when set to simulator (e.g., ** INFO: running simulated vLLM instances **

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants