v0.4.0
Major changes
Models
- New models: Command+R(#3433), Qwen2 MoE(#3346), DBRX(#3660), XVerse (#3610), Jais (#3183).
- New vision language model: LLaVA (#3042)
Production features
- Automatic prefix caching (#2762, #3703) supporting long system prompt to be automatically cached across requests. Use the flag
--enable-prefix-caching
to turn it on. - Support
json_object
in OpenAI server for arbitrary JSON,--use-delay
flag to improve time to first token across many requests, andmin_tokens
to EOS suppression. - Progress in chunked prefill scheduler (#3236, #3538), and speculative decoding (#3103).
- Custom all reduce kernel has been re-enabled after more robustness fixes.
- Replaced cupy dependency due to its bugs.
Hardware
- Improved Neuron support for AWS Inferentia.
- CMake based build system for extensibility.
Ecosystem
What's Changed
- allow user chose log level by --log-level instead of fixed 'info'. by @AllenDou in #3109
- Reorder kv dtype check to avoid nvcc not found error on AMD platform by @cloudhan in #3104
- Add Automatic Prefix Caching by @SageMoore in #2762
- Add vLLM version info to logs and openai API server by @jasonacox in #3161
- [FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark by @zhuohan123 in #3158
- Make it easy to profile workers with nsight by @pcmoritz in #3162
- [DOC] add setup document to support neuron backend by @liangfu in #2777
- [Minor Fix] Remove unused code in benchmark_prefix_caching.py by @gty111 in #3171
- Add document for vllm paged attention kernel. by @pian13131 in #2978
- enable --gpu-memory-utilization in benchmark_throughput.py by @AllenDou in #3175
- [Minor fix] The domain dns.google may cause a socket.gaierror exception by @ttbachyinsda in #3176
- Push logprob generation to LLMEngine by @Yard1 in #3065
- Add health check, make async Engine more robust by @Yard1 in #3015
- Fix the openai benchmarking requests to work with latest OpenAI apis by @wangchen615 in #2992
- [ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs by @hongxiayang in #3123
- Store
eos_token_id
inSequence
for easy access by @njhill in #3166 - [Fix] Avoid pickling entire LLMEngine for Ray workers by @njhill in #3207
- [Tests] Add block manager and scheduler tests by @rkooo567 in #3108
- [Testing] Fix core tests by @cadedaniel in #3224
- A simple addition of
dynamic_ncols=True
by @chujiezheng in #3242 - Add GPTQ support for Gemma by @TechxGenus in #3200
- Update requirements-dev.txt to include package for benchmarking scripts. by @wangchen615 in #3181
- Separate attention backends by @WoosukKwon in #3005
- Measure model memory usage by @mgoin in #3120
- Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) by @jacobthebanana in #3263
- Fix auto prefix bug by @ElizaWszola in #3239
- Connect engine healthcheck to openai server by @njhill in #3260
- Feature add lora support for Qwen2 by @whyiug in #3177
- [Minor Fix] Fix comments in benchmark_serving by @gty111 in #3252
- [Docs] Fix Unmocked Imports by @ywang96 in #3275
- [FIX] Make
flash_attn
optional by @WoosukKwon in #3269 - Move model filelocks from
/tmp/
to~/.cache/vllm/locks/
dir by @mgoin in #3241 - [FIX] Fix prefix test error on main by @zhuohan123 in #3286
- [Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling by @cadedaniel in #3103
- Enhance lora tests with more layer and rank variations by @tterrysun in #3243
- [ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA by @dllehr-amd in #3262
- [BugFix] Fix get tokenizer when using ray by @esmeetu in #3301
- [Fix] Fix best_of behavior when n=1 by @njhill in #3298
- Re-enable the 80 char line width limit by @zhuohan123 in #3305
- [docs] Add LoRA support information for models by @pcmoritz in #3299
- Add distributed model executor abstraction by @zhuohan123 in #3191
- [ROCm] Fix warp and lane calculation in blockReduceSum by @kliuae in #3321
- Support Mistral Model Inference with transformers-neuronx by @DAIZHENWEI in #3153
- docs: Add BentoML deployment doc by @Sherlock113 in #3336
- Fixes #1556 double free by @br3no in #3347
- Add kernel for GeGLU with approximate GELU by @WoosukKwon in #3337
- [Fix] fix quantization arg when using marlin by @DreamTeamWangbowen in #3319
- add hf_transfer to requirements.txt by @RonanKMcGovern in #3031
- fix bias in if, ambiguous by @hliuca in #3259
- [Minor Fix] Use cupy-cuda11x in CUDA 11.8 build by @chenxu2048 in #3256
- Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. by @orsharir in #3350
- Add batched RoPE kernel by @tterrysun in #3095
- Fix lint by @Yard1 in #3388
- [FIX] Simpler fix for async engine running on ray by @zhuohan123 in #3371
- [Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion by @simon-mo in #3383
- allow user to chose which vllm's merics to display in grafana by @AllenDou in #3393
- [Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 by @youkaichao in #3389
- Install
flash_attn
in Docker image by @tdoublep in #3396 - Add args for mTLS support by @declark1 in #3410
- [issue templates] add some issue templates by @youkaichao in #3412
- Fix assertion failure in Qwen 1.5 with prefix caching enabled by @chenxu2048 in #3373
- fix marlin config repr by @qeternity in #3414
- Feature: dynamic shared mem moe_align_block_size_kernel by @akhoroshev in #3376
- [Misc] add HOST_IP env var by @youkaichao in #3419
- Add chat templates for Falcon by @Dinghow in #3420
- Add chat templates for ChatGLM by @Dinghow in #3418
- Fix
dist.broadcast
stall without group argument by @GindaChen in #3408 - Fix tie_word_embeddings for Qwen2. by @fyabc in #3344
- [Fix] Add args for mTLS support by @declark1 in #3430
- Fixes the misuse/mixuse of time.time()/time.monotonic() by @sighingnow in #3220
- [Misc] add error message in non linux platform by @youkaichao in #3438
- Fix issue templates by @hmellor in #3436
- fix document error for value and v_vec illustration by @laneeeee in #3421
- Asynchronous tokenization by @Yard1 in #2879
- Removed Extraneous Print Message From OAI Server by @robertgshaw2-neuralmagic in #3440
- [Misc] PR templates by @youkaichao in #3413
- Fixes the incorrect argument in the prefix-prefill test cases by @sighingnow in #3246
- Replace
lstrip()
withremoveprefix()
to fix Ruff linter warning by @ronensc in #2958 - Fix Baichuan chat template by @Dinghow in #3340
- [Misc] fix line length for entire codebase by @simon-mo in #3444
- Support arbitrary json_object in OpenAI and Context Free Grammar by @simon-mo in #3211
- Fix setup.py neuron-ls issue by @simon-mo in #2671
- [Misc] Define from_dict and to_dict in InputMetadata by @WoosukKwon in #3452
- [CI] Shard tests for LoRA and Kernels to speed up by @simon-mo in #3445
- [Bugfix] Make moe_align_block_size AMD-compatible by @WoosukKwon in #3470
- CI: Add ROCm Docker Build by @simon-mo in #2886
- [Testing] Add test_config.py to CI by @cadedaniel in #3437
- [CI/Build] Fix Bad Import In Test by @robertgshaw2-neuralmagic in #3473
- [Misc] Fix PR Template by @zhuohan123 in #3478
- Cmake based build system by @bnellnm in #2830
- [Core] Zero-copy asdict for InputMetadata by @Yard1 in #3475
- [Misc] Update README for the Third vLLM Meetup by @zhuohan123 in #3479
- [Core] Cache some utils by @Yard1 in #3474
- [Core] print error before deadlock by @youkaichao in #3459
- [Doc] Add docs about OpenAI compatible server by @simon-mo in #3288
- [BugFix] Avoid initializing CUDA too early by @njhill in #3487
- Update dockerfile with ModelScope support by @ifsheldon in #3429
- [Doc] minor fix to neuron-installation.rst by @jimburtoft in #3505
- Revert "[Core] Cache some utils" by @simon-mo in #3507
- [Doc] minor fix of spelling in amd-installation.rst by @jimburtoft in #3506
- Use lru_cache for some environment detection utils by @simon-mo in #3508
- [PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled by @ElizaWszola in #3357
- [Core] Add generic typing to
LRUCache
by @njhill in #3511 - [Misc] Remove cache stream and cache events by @WoosukKwon in #3461
- Abort when nvcc command is not found in the PATH by @AllenDou in #3527
- Check for _is_cuda() in compute_num_jobs by @bnellnm in #3481
- [Bugfix] Fix ROCm support in CMakeLists.txt by @jamestwhedbee in #3534
- [1/n] Triton sampling kernel by @Yard1 in #3186
- [1/n][Chunked Prefill] Refactor input query shapes by @rkooo567 in #3236
- Migrate
logits
computation and gather tomodel_runner
by @esmeetu in #3233 - [BugFix] Hot fix in setup.py for neuron build by @zhuohan123 in #3537
- [PREFIX CACHING FOLLOW UP] OrderedDict-based evictor by @ElizaWszola in #3431
- Fix 1D query issue from
_prune_hidden_states
by @rkooo567 in #3539 - [🚀 Ready to be merged] Added support for Jais models by @grandiose-pizza in #3183
- [Misc][Log] Add log for tokenizer length not equal to vocabulary size by @esmeetu in #3500
- [Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config by @WoosukKwon in #3551
- [BugFix] gemma loading after quantization or LoRA. by @taeminlee in #3553
- [Bugfix][Model] Fix Qwen2 by @esmeetu in #3554
- [Hardware][Neuron] Refactor neuron support by @zhuohan123 in #3471
- Some fixes for custom allreduce kernels by @hanzhi713 in #2760
- Dynamic scheduler delay to improve ITL performance by @tdoublep in #3279
- [Core] Improve detokenization performance for prefill by @Yard1 in #3469
- [Bugfix] use SoftLockFile instead of LockFile by @kota-iizuka in #3578
- [Misc] Fix BLOOM copyright notice by @WoosukKwon in #3591
- [Misc] Bump transformers version by @ywang96 in #3592
- [BugFix] Fix Falcon tied embeddings by @WoosukKwon in #3590
- [BugFix] 1D query fix for MoE models by @njhill in #3597
- [CI] typo fix: is_hip --> is_hip() by @youkaichao in #3595
- [CI/Build] respect the common environment variable MAX_JOBS by @youkaichao in #3600
- [CI/Build] fix flaky test by @youkaichao in #3602
- [BugFix] minor fix: method typo in
rotary_embedding.py
file, get_device() -> device by @jikunshang in #3604 - [Bugfix] Revert "[Bugfix] use SoftLockFile instead of LockFile (#3578)" by @WoosukKwon in #3599
- [Model] Add starcoder2 awq support by @shaonianyr in #3569
- [Core] Refactor Attention Take 2 by @WoosukKwon in #3462
- [Bugfix] fix automatic prefix args and add log info by @gty111 in #3608
- [CI] Try introducing isort. by @rkooo567 in #3495
- [Core] Adding token ranks along with logprobs by @SwapnilDreams100 in #3516
- feat: implement the min_tokens sampling parameter by @tjohnson31415 in #3124
- [Bugfix] API stream returning two stops by @dylanwhawk in #3450
- hotfix isort on logprobs ranks pr by @simon-mo in #3622
- [Feature] Add vision language model support. by @xwjiang2010 in #3042
- Optimize
_get_ranks
in Sampler by @Yard1 in #3623 - [Misc] Include matched stop string/token in responses by @njhill in #2976
- Enable more models to inference based on LoRA by @jeejeelee in #3382
- [Bugfix] Fix ipv6 address parsing bug by @liiliiliil in #3641
- [BugFix] Fix ipv4 address parsing regression by @njhill in #3645
- [Kernel] support non-zero cuda devices in punica kernels by @jeejeelee in #3636
- [Doc]add lora support by @jeejeelee in #3649
- [Misc] Minor fix in KVCache type by @WoosukKwon in #3652
- [Core] remove cupy dependency by @youkaichao in #3625
- [Bugfix] More faithful implementation of Gemma by @WoosukKwon in #3653
- [Bugfix] [Hotfix] fix nccl library name by @youkaichao in #3661
- [Model] Add support for DBRX by @megha95 in #3660
- [Misc] add the "download-dir" option to the latency/throughput benchmarks by @AmadeusChan in #3621
- feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark by @ywang96 in #3277
- Add support for Cohere's Command-R model by @zeppombal in #3433
- [Docs] Add Command-R to supported models by @WoosukKwon in #3669
- [Model] Fix and clean commandr by @esmeetu in #3671
- [Model] Add support for xverse by @hxer7963 in #3610
- [CI/Build] update default number of jobs and nvcc threads to avoid overloading the system by @youkaichao in #3675
- [Kernel] Add Triton MoE kernel configs for DBRX + A100 by @WoosukKwon in #3679
- [Core] [Bugfix] Refactor block manager subsystem for better testability by @cadedaniel in #3492
- [Model] Add support for Qwen2MoeModel by @wenyujin333 in #3346
- [Kernel] DBRX Triton MoE kernel H100 by @ywang96 in #3692
- [2/N] Chunked prefill data update by @rkooo567 in #3538
- [Bugfix] Update neuron_executor.py to add optional vision_language_config. by @adamrb in #3695
- fix benchmark format reporting in buildkite by @simon-mo in #3693
- [CI] Add test case to run examples scripts by @simon-mo in #3638
- [Core] Support multi-node inference(eager and cuda graph) by @esmeetu in #3686
- [Kernel] Add MoE Triton kernel configs for A100 40GB by @WoosukKwon in #3700
- [Bugfix] Set enable_prefix_caching=True in prefix caching example by @WoosukKwon in #3703
- fix logging msg for block manager by @simon-mo in #3701
- [Core] fix del of communicator by @youkaichao in #3702
- [Benchmark] Change mii to use persistent deployment and support tensor parallel by @IKACE in #3628
- bump version to v0.4.0 by @simon-mo in #3705
- Revert "bump version to v0.4.0" by @youkaichao in #3708
- [Test] Make model tests run again and remove --forked from pytest by @rkooo567 in #3631
- [Misc] Minor type annotation fix by @WoosukKwon in #3716
- [Core][Test] move local_rank to the last arg with default value to keep api compatible by @youkaichao in #3711
- add ccache to docker build image by @simon-mo in #3704
- Usage Stats Collection by @yhu422 in #2852
- [BugFix] Fix tokenizer out of vocab size by @esmeetu in #3685
- [BugFix][Frontend] Fix completion logprobs=0 error by @esmeetu in #3731
- [Bugfix] Command-R Max Model Length by @ywang96 in #3727
- bump version to v0.4.0 by @simon-mo in #3712
- [ROCm][Bugfix] Fixed several bugs related to rccl path and attention selector logic by @hongxiayang in #3699
- usage lib get version another way by @simon-mo in #3735
- [BugFix] Use consistent logger everywhere by @njhill in #3738
- [Core][Bugfix] cache len of tokenizer by @youkaichao in #3741
- Fix build when nvtools is missing by @bnellnm in #3698
- CMake build elf without PTX by @simon-mo in #3739
New Contributors
- @cloudhan made their first contribution in #3104
- @SageMoore made their first contribution in #2762
- @jasonacox made their first contribution in #3161
- @gty111 made their first contribution in #3171
- @pian13131 made their first contribution in #2978
- @ttbachyinsda made their first contribution in #3176
- @wangchen615 made their first contribution in #2992
- @chujiezheng made their first contribution in #3242
- @TechxGenus made their first contribution in #3200
- @mgoin made their first contribution in #3120
- @jacobthebanana made their first contribution in #3263
- @ElizaWszola made their first contribution in #3239
- @DAIZHENWEI made their first contribution in #3153
- @Sherlock113 made their first contribution in #3336
- @br3no made their first contribution in #3347
- @DreamTeamWangbowen made their first contribution in #3319
- @RonanKMcGovern made their first contribution in #3031
- @hliuca made their first contribution in #3259
- @orsharir made their first contribution in #3350
- @youkaichao made their first contribution in #3389
- @tdoublep made their first contribution in #3396
- @declark1 made their first contribution in #3410
- @qeternity made their first contribution in #3414
- @akhoroshev made their first contribution in #3376
- @Dinghow made their first contribution in #3420
- @fyabc made their first contribution in #3344
- @laneeeee made their first contribution in #3421
- @bnellnm made their first contribution in #2830
- @ifsheldon made their first contribution in #3429
- @jimburtoft made their first contribution in #3505
- @grandiose-pizza made their first contribution in #3183
- @taeminlee made their first contribution in #3553
- @kota-iizuka made their first contribution in #3578
- @shaonianyr made their first contribution in #3569
- @SwapnilDreams100 made their first contribution in #3516
- @tjohnson31415 made their first contribution in #3124
- @xwjiang2010 made their first contribution in #3042
- @liiliiliil made their first contribution in #3641
- @AmadeusChan made their first contribution in #3621
- @zeppombal made their first contribution in #3433
- @hxer7963 made their first contribution in #3610
- @wenyujin333 made their first contribution in #3346
- @adamrb made their first contribution in #3695
- @IKACE made their first contribution in #3628
- @yhu422 made their first contribution in #2852
Full Changelog: v0.3.3...v0.4.0