Release v0.4.3 · vllm-project/vllm

Highlights

Model Support

LLM

Added support for Falcon (#5069)
Added support for IBM Granite Code models (#4636)
Added blocksparse flash attention kernel and Phi-3-Small model (#4799)
Added Snowflake arctic model implementation (#4652, #4889, #4690)
Supported Dynamic RoPE scaling (#4638)
Supported for long context lora (#4787)

Embedding Models

Intial support for Embedding API with e5-mistral-7b-instruct (#3734)
Cross-attention KV caching and memory-management towards encoder-decoder model support (#4837)

Vision Language Model

Add base class for vision-language models (#4809)
Consolidate prompt arguments to LLM engines (#4328)
LLaVA model refactor (#4910)

Hardware Support

AMD

Add fused_moe Triton configs (#4951)
Add support for Punica kernels (#3140)
Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)

Production Engine

Batch API

Support OpenAI batch file format (#4794)

Making Ray Optional

Add MultiprocessingGPUExecutor (#4539)
Eliminate parallel worker per-step task scheduling overhead (#4894)

Automatic Prefix Caching

Accelerating the hashing function by avoiding deep copies (#4696)

Speculative Decoding

CUDA graph support (#4295)
Enable TP>1 speculative decoding (#4840)
Improve n-gram efficiency (#4724)

Performance Optimization

Quantization

Add GPTQ Marlin 2:4 sparse structured support (#4790)
Initial Activation Quantization Support (#4525)
Marlin prefill performance improvement (about better on average) (#4983)
Automatically Detect SparseML models (#5119)

Better Attention Kernel

Use flash-attn for decoding (#3648)

FP8

Improve FP8 linear layer performance (#4691)
Add w8a8 CUTLASS kernels (#4749)
Support for CUTLASS kernels in CUDA graphs (#4954)
Load FP8 kv-cache scaling factors from checkpoints (#4893)
Make static FP8 scaling more robust (#4570)
Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)

Optimize Distributed Communication

change python dict to pytorch tensor (#4607)
change python dict to pytorch tensor for blocks to swap (#4659)
improve paccess check (#4992)
remove vllm-nccl (#5091)
support both cpu and device tensor in broadcast tensor dict (#4660)

Extensible Architecture

Pipeline Parallelism

refactor custom allreduce to support multiple tp groups (#4754)
refactor pynccl to hold multiple communicators (#4591)
Support PP PyNCCL Groups (#4988)

What's Changed

Disable cuda version check in vllm-openai image by @zhaoyang-star in #4530
[Bugfix] Fix asyncio.Task not being subscriptable by @DarkLight1337 in #4623
[CI] use ccache actions properly in release workflow by @simon-mo in #4629
[CI] Add retry for agent lost by @cadedaniel in #4633
Update lm-format-enforcer to 0.10.1 by @noamgat in #4631
[Kernel] Make static FP8 scaling more robust by @pcmoritz in #4570
[Core][Optimization] change python dict to pytorch tensor by @youkaichao in #4607
[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. by @Alexei-V-Ivanov-AMD in #4642
[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora by @FurtherAI in #4609
[Core][Optimization] change copy-on-write from dict[int, list] to list by @youkaichao in #4648
[Bug fix][Core] fixup ngram not setup correctly by @leiwen83 in #4551
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict by @youkaichao in #4660
[Core] Optimize sampler get_logprobs by @rkooo567 in #4594
[CI] Make mistral tests pass by @rkooo567 in #4596
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi by @DefTruth in #4573
[Misc] Add get_name method to attention backends by @WoosukKwon in #4685
[Core] Faster startup for LoRA enabled models by @Yard1 in #4634
[Core][Optimization] change python dict to pytorch tensor for blocks to swap by @youkaichao in #4659
[CI/Test] fix swap test for multi gpu by @youkaichao in #4689
[Misc] Use vllm-flash-attn instead of flash-attn by @WoosukKwon in #4686
[Dynamic Spec Decoding] Auto-disable by the running queue size by @comaniac in #4592
[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs by @cadedaniel in #4672
[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin by @alexm-neuralmagic in #4626
[Frontend] add tok/s speed metric to llm class when using tqdm by @MahmoudAshraf97 in #4400
[Frontend] Move async logic outside of constructor by @DarkLight1337 in #4674
[Misc] Remove unnecessary ModelRunner imports by @WoosukKwon in #4703
[Misc] Set block size at initialization & Fix test_model_runner by @WoosukKwon in #4705
[ROCm] Add support for Punica kernels on AMD GPUs by @kliuae in #3140
[Bugfix] Fix CLI arguments in OpenAI server docs by @DarkLight1337 in #4709
[Bugfix] Update grafana.json by @robertgshaw2-neuralmagic in #4711
[Bugfix] Add logs for all model dtype casting by @mgoin in #4717
[Model] Snowflake arctic model implementation by @sfc-gh-hazhang in #4652
[Kernel] [FP8] Improve FP8 linear layer performance by @pcmoritz in #4691
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support by @comaniac in #4535
[Core][Distributed] refactor pynccl to hold multiple communicators by @youkaichao in #4591
[Misc] Keep only one implementation of the create_dummy_prompt function. by @AllenDou in #4716
chunked-prefill-doc-syntax by @simon-mo in #4603
[Core]fix type annotation for swap_blocks by @jikunshang in #4726
[Misc] Apply a couple g++ cleanups by @stevegrubb in #4719
[Core] Fix circular reference which leaked llm instance in local dev env by @rkooo567 in #4737
[Bugfix] Fix CLI arguments in OpenAI server docs by @AllenDou in #4729
[Speculative decoding] CUDA graph support by @heeju-kim2 in #4295
[CI] Nits for bad initialization of SeqGroup in testing by @robertgshaw2-neuralmagic in #4748
[Core][Test] fix function name typo in custom allreduce by @youkaichao in #4750
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API by @CatherineSue in #3734
[Model] Add support for IBM Granite Code models by @yikangshen in #4636
[CI/Build] Tweak Marlin Nondeterminism Issues In CI by @robertgshaw2-neuralmagic in #4713
[CORE] Improvement in ranks code by @SwapnilDreams100 in #4718
[Core][Distributed] refactor custom allreduce to support multiple tp groups by @youkaichao in #4754
[CI/Build] Move test_utils.py to tests/utils.py by @DarkLight1337 in #4425
[Scheduler] Warning upon preemption and Swapping by @rkooo567 in #4647
[Misc] Enhance attention selector by @WoosukKwon in #4751
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update tensorizer to version 2.9.0 by @sangstar in #4208
[Speculative decoding] Improve n-gram efficiency by @comaniac in #4724
[Kernel] Use flash-attn for decoding by @skrider in #3648
[Bugfix] Fix dynamic FP8 quantization for Mixtral by @pcmoritz in #4793
[Doc] Shorten README by removing supported model list by @zhuohan123 in #4796
[Doc] Add API reference for offline inference by @DarkLight1337 in #4710
[Doc] Add meetups to the doc by @zhuohan123 in #4798
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies by @KuntaiDu in #4696
[Bugfix][Doc] Fix CI failure in docs by @DarkLight1337 in #4804
[Core] Add MultiprocessingGPUExecutor by @njhill in #4539
Add 4th meetup announcement to readme by @simon-mo in #4817
Revert "[Kernel] Use flash-attn for decoding (#3648)" by @rkooo567 in #4820
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API by @rkooo567 in #4681
[CI/Build] Further decouple HuggingFace implementation from ours during tests by @DarkLight1337 in #4166
[Bugfix] Properly set distributed_executor_backend in ParallelConfig by @zifeitong in #4816
[Doc] Highlight the fourth meetup in the README by @zhuohan123 in #4842
[Frontend] Re-enable custom roles in Chat Completions API by @DarkLight1337 in #4758
[Frontend] Support OpenAI batch file format by @wuisawesome in #4794
[Core] Implement sharded state loader by @aurickq in #4690
[Speculative decoding][Re-take] Enable TP>1 speculative decoding by @comaniac in #4840
Add marlin unit tests and marlin benchmark script by @alexm-neuralmagic in #4815
[Kernel] add bfloat16 support for gptq marlin kernel by @jinzhen-lin in #4788
[docs] Fix typo in examples filename openi -> openai by @wuisawesome in #4864
[Frontend] Separate OpenAI Batch Runner usage from API Server by @wuisawesome in #4851
[Bugfix] Bypass authorization API token for preflight requests by @dulacp in #4862
Add GPTQ Marlin 2:4 sparse structured support by @alexm-neuralmagic in #4790
Add JSON output support for benchmark_latency and benchmark_throughput by @simon-mo in #4848
[ROCm][AMD][Bugfix] adding a missing triton autotune config by @hongxiayang in #4845
[Core][Distributed] remove graph mode function by @youkaichao in #4818
[Misc] remove old comments by @youkaichao in #4866
[Kernel] Add punica dimension for Qwen1.5-32B LoRA by @Silencioo in #4850
[Kernel] Add w8a8 CUTLASS kernels by @tlrmchlsmth in #4749
[Bugfix] Fix FP8 KV cache support by @WoosukKwon in #4869
Support to serve vLLM on Kubernetes with LWS by @kerthcet in #4829
[Frontend] OpenAI API server: Do not add bos token by default when encoding by @bofenghuang in #4688
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests by @Alexei-V-Ivanov-AMD in #4797
[Bugfix] fix rope error when load models with different dtypes by @jinzhen-lin in #4835
Sync huggingface modifications of qwen Moe model by @eigen2017 in #4774
[Doc] Update Ray Data distributed offline inference example by @Yard1 in #4871
[Bugfix] Relax tiktoken to >= 0.6.0 by @mgoin in #4890
[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used by @alexeykondrat in #4658
[Lora] Support long context lora by @rkooo567 in #4787
[Bugfix][Model] Add base class for vision-language models by @DarkLight1337 in #4809
[Kernel] Add marlin_24 unit tests by @alexm-neuralmagic in #4901
[Kernel] Add flash-attn back by @WoosukKwon in #4907
[Model] LLaVA model refactor by @DarkLight1337 in #4910
Remove marlin warning by @alexm-neuralmagic in #4918
[Bugfix]: Fix communication Timeout error in safety-constrained distributed System by @ZwwWayne in #4914
[Build/CI] Enabling AMD Entrypoints Test by @Alexei-V-Ivanov-AMD in #4834
[Bugfix] Fix dummy weight for fp8 by @mzusman in #4916
[Core] Sharded State Loader download from HF by @aurickq in #4889
[Doc]Add documentation to benchmarking script when running TGI by @KuntaiDu in #4920
[Core] Fix scheduler considering "no LoRA" as "LoRA" by @Yard1 in #4897
[Model] add rope_scaling support for qwen2 by @hzhwcmhf in #4930
[Model] Add Phi-2 LoRA support by @Isotr0py in #4886
[Docs] Add acknowledgment for sponsors by @simon-mo in #4925
[CI/Build] Codespell ignore build/ directory by @mgoin in #4945
[Bugfix] Fix flag name for max_seq_len_to_capture by @kerthcet in #4935
[Bugfix][Kernel] Add head size check for attention backend selection by @Isotr0py in #4944
[Frontend] Dynamic RoPE scaling by @sasha0552 in #4638
[CI/Build] Enforce style for C++ and CUDA code with clang-format by @mgoin in #4722
[misc] remove comments that were supposed to be removed by @rkooo567 in #4977
[Kernel] Fixup for CUTLASS kernels in CUDA graphs by @tlrmchlsmth in #4954
[Misc] Load FP8 kv-cache scaling factors from checkpoints by @comaniac in #4893
[Model] LoRA gptbigcode implementation by @raywanb in #3949
[Core] Eliminate parallel worker per-step task scheduling overhead by @njhill in #4894
[Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig by @pcmoritz in #4991
[Misc] Take user preference in attention selector by @comaniac in #4960
Marlin 24 prefill performance improvement (about 25% better on average) by @alexm-neuralmagic in #4983
[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined by @LetianLee in #5009
[Core][1/N] Support PP PyNCCL Groups by @andoorve in #4988
[Kernel] Initial Activation Quantization Support by @dsikka in #4525
[Core]: Option To Use Prompt Token Ids Inside Logits Processor by @kezouke in #4985
[Doc] add ccache guide in doc by @youkaichao in #5012
[Bugfix] Fix Mistral v0.3 Weight Loading by @robertgshaw2-neuralmagic in #5005
[Core][Bugfix]: fix prefix caching for blockv2 by @leiwen83 in #4764
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model by @linxihui in #4799
[Misc] add logging level env var by @youkaichao in #5045
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding by @LiuXiaoxuanPKU in #5000
[Misc] Make Serving Benchmark More User-friendly by @ywang96 in #5044
[Bugfix / Core] Prefix Caching Guards (merged with main) by @zhuohan123 in #4846
[Core] Allow AQLM on Pascal by @sasha0552 in #5058
[Model] Add support for falcon-11B by @Isotr0py in #5069
[Core] Sliding window for block manager v2 by @mmoskal in #4545
[BugFix] Fix Embedding Models with TP>1 by @robertgshaw2-neuralmagic in #5075
[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X by @divakar-amd in #4951
[Docs] Add Dropbox as sponsors by @simon-mo in #5089
[Core] Consolidate prompt arguments to LLM engines by @DarkLight1337 in #4328
[Bugfix] Remove the last EOS token unless explicitly specified by @jsato8094 in #5077
[Misc] add gpu_memory_utilization arg by @pandyamarut in #5079
[Core][Optimization] remove vllm-nccl by @youkaichao in #5091
[Bugfix] Fix arguments passed to Sequence in stop checker test by @DarkLight1337 in #5092
[Core][Distributed] improve p2p access check by @youkaichao in #4992
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) by @afeldman-nm in #4837
[Doc]Replace deprecated flag in readme by @ronensc in #4526
[Bugfix][CI/Build] Fix test and improve code for merge_async_iterators by @DarkLight1337 in #5096
[Bugfix][CI/Build] Fix codespell failing to skip files in git diff by @DarkLight1337 in #5097
[Core] Avoid the need to pass None values to Sequence.inputs by @DarkLight1337 in #5099
[Bugfix] logprobs is not compatible with the OpenAI spec #4795 by @Etelis in #5031
[Doc][Build] update after removing vllm-nccl by @youkaichao in #5103
[Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter by @alexm-neuralmagic in #5108
[CI/Build] Docker cleanup functionality for amd servers by @okakarpa in #5112
[BUGFIX] [FRONTEND] Correct chat logprobs by @br3no in #5029
[Bugfix] Automatically Detect SparseML models by @robertgshaw2-neuralmagic in #5119
[CI/Build] increase wheel size limit to 200 MB by @youkaichao in #5130
[Misc] remove duplicate definition of seq_lens_tensor in model_runner.py by @ita9naiwa in #5129
[Doc] Use intersphinx and update entrypoints docs by @DarkLight1337 in #5125
add doc about serving option on dstack by @deep-diver in #3074
Bump version to v0.4.3 by @simon-mo in #5046
[Build] Disable sm_90a in cu11 by @simon-mo in #5141
[Bugfix] Avoid Warnings in SparseML Activation Quantization by @robertgshaw2-neuralmagic in #5120
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) by @alexm-neuralmagic in #5136
[Model] Support MAP-NEO model by @xingweiqu in #5081
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" by @simon-mo in #5149
[Misc]: optimize eager mode host time by @functionxu123 in #4196
[Model] Enable FP8 QKV in MoE and refine kernel tuning script by @comaniac in #5039
[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support by @njhill in #5171
[Build] Guard against older CUDA versions when building CUTLASS 3.x kernels by @tlrmchlsmth in #5168

New Contributors

@MahmoudAshraf97 made their first contribution in #4400
@sfc-gh-hazhang made their first contribution in #4652
@stevegrubb made their first contribution in #4719
@heeju-kim2 made their first contribution in #4295
@yikangshen made their first contribution in #4636
@KuntaiDu made their first contribution in #4696
@wuisawesome made their first contribution in #4794
@aurickq made their first contribution in #4690
@jinzhen-lin made their first contribution in #4788
@dulacp made their first contribution in #4862
@Silencioo made their first contribution in #4850
@tlrmchlsmth made their first contribution in #4749
@kerthcet made their first contribution in #4829
@bofenghuang made their first contribution in #4688
@eigen2017 made their first contribution in #4774
@alexeykondrat made their first contribution in #4658
@ZwwWayne made their first contribution in #4914
@mzusman made their first contribution in #4916
@hzhwcmhf made their first contribution in #4930
@raywanb made their first contribution in #3949
@LetianLee made their first contribution in #5009
@dsikka made their first contribution in #4525
@kezouke made their first contribution in #4985
@linxihui made their first contribution in #4799
@divakar-amd made their first contribution in #4951
@pandyamarut made their first contribution in #5079
@afeldman-nm made their first contribution in #4837
@Etelis made their first contribution in #5031
@okakarpa made their first contribution in #5112
@deep-diver made their first contribution in #3074
@xingweiqu made their first contribution in #5081
@functionxu123 made their first contribution in #4196

Full Changelog: v0.4.2...v0.4.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.3

Highlights

Model Support

LLM

Embedding Models

Vision Language Model

Hardware Support

AMD

Production Engine

Batch API

Making Ray Optional

Automatic Prefix Caching

Speculative Decoding

Performance Optimization

Quantization

Better Attention Kernel

FP8

Optimize Distributed Communication

Extensible Architecture

Pipeline Parallelism

What's Changed

New Contributors

Contributors