-
Notifications
You must be signed in to change notification settings - Fork 7
feat: Add scripts for kubernetes dev env using vLLM and vLLM-p2p #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…tup for kvcache-aware) Signed-off-by: Kfir Toledo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apiVersion: inference.networking.x-k8s.io/v1alpha2 | ||
kind: InferenceModel | ||
metadata: | ||
name: mistarli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: mistarli | |
name: mistral |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I removed it and use just a base model
# securityContext: | ||
# allowPrivilegeEscalation: false | ||
# capabilities: | ||
# drop: | ||
# - ALL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's helpful for future readers to see this commented out bit, we should add a comment explaining why. Otherwise we should probably just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
# securityContext: | ||
# runAsUser: ${PROXY_UID} | ||
# runAsNonRoot: true | ||
# seccompProfile: | ||
# type: RuntimeDefault |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above: if this is an important breadcrumb for future readers, let's add a comment explaining why, otherwise if its just leftovers let's remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
- name: lora-adapter-syncer | ||
tty: true | ||
stdin: true | ||
image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this is a hardcoded image on the main
tag, and not something we patch from the kustomization.yaml
. Do we want this to be customizable? Or at the very least, do we want to pin to a specific SHA so that updates are more deliberate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
# ------------------------------------------------------------------------------ | ||
# vLLM Deployment | ||
# | ||
# This deploys the full vLLM model server, capable of serving real models such | ||
# as Llama 3.1-8B-Instruct via the OpenAI-compatible API. It is intended for | ||
# environments with GPU resources and where full inference capabilities are | ||
# required. | ||
# | ||
# The deployment can be customized using environment variables to set: | ||
# - The container image and tag (VLLM_IMAGE, VLLM_TAG) | ||
# - The model to load (MODEL_NAME) | ||
# | ||
# This setup is suitable for testing and production with Kubernetes (including | ||
# GPU-enabled nodes or clusters with scheduling for `nvidia.com/gpu`). | ||
# ----------------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -0,0 +1,29 @@ | |||
apiVersion: kustomize.config.k8s.io/v1beta1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed we added some documentation on the kustomization.yaml
for the vllm
component, but not for this one. Perhaps we should add a little comment here explaining how this one differs from the standard one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
type: NodePort | ||
type: LoadBalancer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason NodePort was chosen originally is because it's more universal and in theory this deployment is meant to work generally on Kubernetes clusters. Perhaps we can make this variable, defaulting to NodePort
and allowing folks to opt-in to LoadBalancer
? LMKWYT? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, I agree that on Kind we should use nodeport, but this is not for Kind, in a development environment, I think we should use LB. This is also how the GIE uses that in their example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an LB in the test clusters?
- ../../../components/vllm-sim/ | ||
- ../../../components/inference-gateway/ | ||
- gateway-parameters.yaml | ||
|
||
images: | ||
- name: quay.io/vllm-d/vllm-sim | ||
newName: ${VLLM_SIM_IMAGE} | ||
newTag: ${VLLM_SIM_TAG} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea with the deployments under deploy/environments
is that these use the deploy/components
to provide some level of a working environment. With this change, this environment wont really have anything that works OOTB, so maybe what we should do is leave the simulator in this one, and rename it deploy/environments/dev/kubernetes-kgateway-vllm-sim
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason I separated vLLM from the gateway is that I believe it will be similar in Istio. I didn’t want to duplicate the files because the vLLM YAMLs shouldn’t depend on which gateway we use.
Another option we could add a third level of abstraction: components -> vllm -> gateway.
It would be cleaner, but possibly also more complicated to understand.
@@ -0,0 +1,11 @@ | |||
apiVersion: kustomize.config.k8s.io/v1beta1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok, I see what you're doing with the naming now. The difference now is that any one of these deployments is deploying only a working VLLM stack, and then you have to deploy your inference-gateway stack separately.
cc @tumido @Gregory-Pereira @vMaroon just wanting to check with you on how this will work with your Helm chart?
echo "ERROR: NAMESPACE environment variable is not set." | ||
exit 1 | ||
fi | ||
if [[ -z "${VLLM_MODE:-}" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider whether we should just default this to vllm-sim
given that we know it's functional on practically any cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but I want people to set it up to be sure on which environment they testing and running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for using vLLM simulator. If you think you want to steer people towards using real vLLM, you can add a print message when set to simulator (e.g., ** INFO: running simulated vLLM instances **
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
name: ${REDIS_SVC_NAME} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be turned to REDIS_DEPLOYMENT_NAME
or something like that, and in the Service CR add a -service
extension. Otherwise the deployment's name will be weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Kfir Toledo <[email protected]>
- "-c" | ||
args: | ||
- | | ||
export LMCACHE_DISTRIBUTED_URL=$${${POD_IP}}:80 && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this syntax correct? Shouldnt it be:
export LMCACHE_DISTRIBUTED_URL=$${${POD_IP}}:80 && \ | |
export LMCACHE_DISTRIBUTED_URL=${POD_IP}:80 && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, you are right, the end goal is that it will be ${POD_IP},
But the issue is I use envsubst, so I need to change it to this =$${${POD_IP}} + export POD_IP="POD_IP",
That after I trigger envsubst, I will get ${POD_IP} (took me a lot of time to figure it out)
390c50a
to
b189362
Compare
@shaneutt , PTOL. |
|
||
The same thing will need to be done for the EPP: | ||
- `vllm-sim`: Lightweight simulator for simple environments | ||
- `vllm`: Full vLLM model server for real inference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `vllm`: Full vLLM model server for real inference | |
- `vllm`: Full vLLM model server, using GPU/CPU for inferencing |
``` | ||
|
||
- Set hugging face token variable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Set hugging face token variable: | |
- Set Hugging Face token variable: |
@@ -152,6 +152,13 @@ Create the namespace: | |||
```console | |||
kubectl create namespace ${NAMESPACE} | |||
``` | |||
Set the default namespace for kubectl commands | |||
|
|||
```console |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please be consistent with using bash
or console
for command line/shell snippets
|
||
```console | ||
export VLLM_SIM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the images being set elsewhere, to match simulator/vLLM/vLLM-p2p?
@@ -202,16 +208,72 @@ make environment.dev.kubernetes | |||
This will deploy the entire stack to whatever namespace you chose. You can test | |||
by exposing the inference `Gateway` via port-forward: | |||
|
|||
```console | |||
kubectl -n ${NAMESPACE} port-forward service/inference-gateway-istio 8080:80 | |||
```bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: console
vs bash
?
# - The container image and tag (VLLM_IMAGE, VLLM_TAG) | ||
# - The model to load (MODEL_NAME) | ||
# | ||
# This setup is suitable for testing and production with Kubernetes (including |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to stand behind the production
qualification? As opposed to testing in real hardware
.
I imagine there are ways to configure vLLM optimally for hardware that goes beyond what's done in this PR.
configMapGenerator: | ||
- name: vllm-model-config | ||
literals: | ||
- MODEL_NAME=${MODEL_NAME} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing newline at end of file
type: NodePort | ||
type: LoadBalancer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an LB in the test clusters?
- name: quay.io/vllm-d/gateway-api-inference-extension/epp | ||
newName: ${EPP_IMAGE} | ||
newTag: ${EPP_TAG} | ||
|
||
patches: | ||
- path: patch-deployments.yaml | ||
- path: patch-gateways.yaml | ||
- path: patch-gateways.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing newline at end of file
echo "ERROR: NAMESPACE environment variable is not set." | ||
exit 1 | ||
fi | ||
if [[ -z "${VLLM_MODE:-}" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for using vLLM simulator. If you think you want to steer people towards using real vLLM, you can add a print message when set to simulator (e.g., ** INFO: running simulated vLLM instances **
Signed-off-by: Kfir Toledo <[email protected]>
Add support for Kubernetes environment development using GIE with KGateway and vLLM
This PR introduces support for the
vllm
mode, enabling integration testing of GIE with vLLM.It also adds support for the
vllm-p2p
mode, which includes: