You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm from IBM Research, and our team helps drive product adoption of vLLM for LLM serving workloads. We are looking for ways that a product can enable users to tune and deploy their own LoRA adapters in a multi-tenant SaaS context. I was pointed here after writing a (pretty simple) RFC in VLLM asking if it should be responsible for managing LoRA adapters in a cluster: vllm-project/vllm#12174. I'll detail a bit more about our expected requirements below.
Our primary concern here is scale: We know from experience that we may be quickly looking at a situation where a deployment of vLLM for a given model with tens of replicas may have tens of thousands of LoRA adapters registered for it. We also expect that the inference usage of these adapters will be very unbalanced. Most users will try out an example where they will tune and deploy an adapter and try a few requests then forget about it until they review their cloud charges months later and delete it, while a smaller set of users will tune adapters that are critical for their business needs and use them for all of their production traffic.
Another concern is the flexibility or pluggability of routing and scaling logic for individual adapters. We expect products to come with functional requirements around separate SLO tiers, e.g. maybe for some LoRA adapters it's fine to queue for minutes / only scale to a single replica with no availability guarantees, but other adapters may need to stay scaled up across many replicas even when they're not in use to meet a strict latency SLO.
A final concern I'll mention is the ease of integration and operation. Our product teams operate with a hybrid-cloud-first mentality- anything that runs as a SaaS service also needs to be ready for a customer to install and operate in their own cloud of choice. This often leads teams to prefer well-documented solutions with cloud-native APIs, for example we've had success in the past with products adopting kserve/modelmesh to manage resource-efficient deployments of traditional AI models.
Looking forward to learning more about this project, and to see what's already being built to handle lora serving!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello all!
I'm from IBM Research, and our team helps drive product adoption of vLLM for LLM serving workloads. We are looking for ways that a product can enable users to tune and deploy their own LoRA adapters in a multi-tenant SaaS context. I was pointed here after writing a (pretty simple) RFC in VLLM asking if it should be responsible for managing LoRA adapters in a cluster: vllm-project/vllm#12174. I'll detail a bit more about our expected requirements below.
Our primary concern here is scale: We know from experience that we may be quickly looking at a situation where a deployment of vLLM for a given model with tens of replicas may have tens of thousands of LoRA adapters registered for it. We also expect that the inference usage of these adapters will be very unbalanced. Most users will try out an example where they will tune and deploy an adapter and try a few requests then forget about it until they review their cloud charges months later and delete it, while a smaller set of users will tune adapters that are critical for their business needs and use them for all of their production traffic.
Another concern is the flexibility or pluggability of routing and scaling logic for individual adapters. We expect products to come with functional requirements around separate SLO tiers, e.g. maybe for some LoRA adapters it's fine to queue for minutes / only scale to a single replica with no availability guarantees, but other adapters may need to stay scaled up across many replicas even when they're not in use to meet a strict latency SLO.
A final concern I'll mention is the ease of integration and operation. Our product teams operate with a hybrid-cloud-first mentality- anything that runs as a SaaS service also needs to be ready for a customer to install and operate in their own cloud of choice. This often leads teams to prefer well-documented solutions with cloud-native APIs, for example we've had success in the past with products adopting kserve/modelmesh to manage resource-efficient deployments of traditional AI models.
Looking forward to learning more about this project, and to see what's already being built to handle lora serving!
Beta Was this translation helpful? Give feedback.
All reactions