You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was curious as to what optimizations VLLM deploys to improve the performance of LLM inference compared to a native pytorch call.
From reading the docs I saw paged attention and prefix caching as two primary optimizations but there were others such as disaggregated prefilling.
Assuming I simply call a model:
model = LLM(model=model_name, tensor_parallel_size=8)
outputs = model.generate(prompt, sampling_params)
What optimizations are automatically applied? I'm assuming paged attention and prefix catching but is there documentation discussing other optimizations? Furthermore I noticed that a fair amount of computation/time is done when you initially load the model prior to calling generate, this doesn't seem to be just loading weights either. Is there some precalculation of the KV Cache or something done here? Would appreciate any insight.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi All,
I was curious as to what optimizations VLLM deploys to improve the performance of LLM inference compared to a native pytorch call.
From reading the docs I saw paged attention and prefix caching as two primary optimizations but there were others such as disaggregated prefilling.
Assuming I simply call a model:
model = LLM(model=model_name, tensor_parallel_size=8)
outputs = model.generate(prompt, sampling_params)
What optimizations are automatically applied? I'm assuming paged attention and prefix catching but is there documentation discussing other optimizations? Furthermore I noticed that a fair amount of computation/time is done when you initially load the model prior to calling generate, this doesn't seem to be just loading weights either. Is there some precalculation of the KV Cache or something done here? Would appreciate any insight.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions