VLLM Inference Optimizations #16906

wlm64 · 2025-04-21T06:50:26Z

wlm64
Apr 21, 2025

Hi All,

I was curious as to what optimizations VLLM deploys to improve the performance of LLM inference compared to a native pytorch call.

From reading the docs I saw paged attention and prefix caching as two primary optimizations but there were others such as disaggregated prefilling.

Assuming I simply call a model:

model = LLM(model=model_name, tensor_parallel_size=8)
outputs = model.generate(prompt, sampling_params)

What optimizations are automatically applied? I'm assuming paged attention and prefix catching but is there documentation discussing other optimizations? Furthermore I noticed that a fair amount of computation/time is done when you initially load the model prior to calling generate, this doesn't seem to be just loading weights either. Is there some precalculation of the KV Cache or something done here? Would appreciate any insight.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLLM Inference Optimizations #16906

{{title}}

Replies: 0 comments

Select a reply

VLLM Inference Optimizations #16906

wlm64 Apr 21, 2025

Replies: 0 comments

wlm64
Apr 21, 2025