Skip to content

Releases: vllm-project/vllm

v0.5.3

23 Jul 07:01
bb2fc08
Compare
Choose a tag to compare

Highlights

Model Support

  • vLLM now supports Meta Llama 3.1! Please checkout our blog here for initial details on running the model.
    • Please checkout this thread for any known issues related to the model.
    • The model runs on a single 8xH100 or 8xA100 node using FP8 quantization (#6606, #6547, #6487, #6593, #6511, #6515, #6552)
    • The BF16 version of the model should run on multiple nodes using pipeline parallelism (docs). If you have fast network interconnect, you might want to consider full tensor paralellism as well. (#6599, #6598, #6529, #6569)
    • In order to support long context, a new rope extension method has been added and chunked prefill has been turned on by default for Meta Llama 3.1 series of model. (#6666, #6553, #6673)
  • Support Mistral-Nemo (#6548)
  • Support Chameleon (#6633, #5770)
  • Pipeline parallel support for Mixtral (#6516)

Hardware Support

Performance Enhancements

  • Add AWQ support to the Marlin kernel. This brings significant (1.5-2x) perf improvements to existing AWQ models! (#6612)
  • Progress towards refactoring for SPMD worker execution. (#6032)
  • Progress in improving prepare inputs procedure. (#6164, #6338, #6596)
  • Memory optimization for pipeline parallelism. (#6455)

Production Engine

  • Correctness testing for pipeline parallel and CPU offloading (#6410, #6549)
  • Support dynamically loading Lora adapter from HuggingFace (#6234)
  • Pipeline Parallel using stdlib multiprocessing module (#6130)

Others

  • A CPU offloading implementation, you can now use --cpu-offload-gb to control how much memory to "extend" the RAM with. (#6496)
  • The new vllm CLI is now ready for testing. It comes with three commands: serve, complete, and chat. Feedback and improvements are greatly welcomed! (#6431)
  • The wheels now build on Ubuntu 20.04 instead of 22.04. (#6517)

What's Changed

Read more

v0.5.2

15 Jul 18:01
4cf256a
Compare
Choose a tag to compare

Major Changes

  • ❗Planned breaking change ❗: we plan to remove beam search (see more in #6226) in the next few releases. This release come with a warning when beam search is enabled for the request. Please voice your concern in the RFC if you do have a valid use case for beam search in vLLM
  • The release has moved to a Python version agnostic wheel (#6394). A single wheel can be installed across Python versions vLLM supports.

Highlights

Model Support

Hardware

  • AMD: unify CUDA_VISIBLE_DEVICES usage (#6352)

Performance

  • ZeroMQ fallback for broadcasting large objects (#6183)
  • Simplify code to support pipeline parallel (#6406)
  • Turn off CUTLASS scaled_mm for Ada Lovelace (#6384)
  • Use CUTLASS kernels for the FP8 layers with Bias (#6270)

Features

  • Enabling bonus token in speculative decoding for KV cache based models (#5765)
  • Medusa Implementation with Top-1 proposer (#4978)
  • An experimental vLLM CLI for serving and querying OpenAI compatible server (#5090)

Others

  • Add support for multi-node on CI (#5955)
  • Benchmark: add H100 suite (#6047)
  • [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362)
  • Build some nightly wheels (#6380)

What's Changed

Read more

v0.5.1

05 Jul 19:47
79d406e
Compare
Choose a tag to compare

Highlights

  • vLLM now has pipeline parallelism! (#4412, #5408, #6115, #6120). You can now run the API server with --pipeline-parallel-size. This feature is in early stage, please let us know your feedback.

Model Support

  • Support Gemma 2 (#5908, #6051). Please note that for correctness, Gemma should run with FlashInfer backend which supports logits soft cap. The wheels for FlashInfer can be downloaded here
  • Support Jamba (#4115). This is vLLM's first state space model!
  • Support Deepseek-V2 (#4650). Please note that MLA (Multi-head Latent Attention) is not implemented and we are looking for contribution!
  • Vision Language Model adding support for Phi3-Vision, dynamic image size, and a registry for processing model inputs (#4986, #5276, #5214)
    • Notably, it has a breaking change that all VLM specific arguments are now removed from engine APIs so you no longer need to set it globally via CLI. However, you now only need to pass in <image> into the prompt instead of complicated prompt formatting. See more here
    • There is also a new guide on adding VLMs! We would love your contribution for new models!

Hardware Support

Production Service

  • Support for sharded tensorized models (#4990)
  • Continous streaming of OpenAI response token stats (#5742)

Performance

  • Enhancement in distributed communication via shared memory (#5399)
  • Latency enhancement in block manager (#5584)
  • Enhancements to compressed-tensors supporting Marlin, W4A16 (#5435, #5385)
  • Faster FP8 quantize kernel (#5396), FP8 on Ampere (#5975)
  • Option to use FlashInfer for prefill, decode, and CUDA Graph for decode (#4628)
  • Speculative Decoding
  • Draft Model Runner (#5799)

Development Productivity

  • Post merge benchmark is now available at perf.vllm.ai!
  • Addition of A100 in CI environment (#5658)
  • Step towards nightly wheel publication (#5610)

What's Changed

Read more

v0.5.0.post1

14 Jun 02:43
50eed24
Compare
Choose a tag to compare

Highlights

  • Add initial TPU integration (#5292)
  • Fix crashes when using FlashAttention backend (#5478)
  • Fix issues when using num_devices < num_available_devices (#5473)

What's Changed

New Contributors

Full Changelog: v0.5.0...v0.5.0.post1

v0.5.0

11 Jun 18:16
8f89d72
Compare
Choose a tag to compare

Highlights

Production Features

Hardware Support

  • Improvements to the Intel CPU CI (#4113, #5241)
  • Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047)

Others

  • Debugging tips documentation (#5409, #5430)
  • Dynamic Per-Token Activation Quantization (#5037)
  • Customizable RoPE theta (#5197)
  • Enable passing multiple LoRA adapters at once to generate() (#5300)
  • OpenAI tools support named functions (#5032)
  • Support stream_options for OpenAI protocol (#5319, #5135)
  • Update Outlines Integration from FSM to Guide (#4109)

What's Changed

  • [CI/Build] CMakeLists: build all extensions' cmake targets at the same time by @dtrifiro in #5034
  • [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU by @tlrmchlsmth in #5137
  • [Kernel] Update Cutlass fp8 configs by @varun-sundar-rabindranath in #5144
  • [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py by @dashanji in #5151
  • [Bugfix] Fix call to init_logger in openai server by @NadavShmayo in #4765
  • [Feature][Kernel] Support bitsandbytes quantization and QLoRA by @chenqianfzh in #4776
  • [Bugfix] Remove deprecated @abstractproperty by @zhuohan123 in #5174
  • [Bugfix]: Fix issues related to prefix caching example (#5177) by @Delviet in #5180
  • [BugFix] Prevent LLM.encode for non-generation Models by @robertgshaw2-neuralmagic in #5184
  • Update test_ignore_eos by @simon-mo in #4898
  • [Frontend][OpenAI] Support for returning max_model_len on /v1/models response by @Avinash-Raj in #4643
  • [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer by @divakar-amd in #4927
  • [Misc] Simplify code and fix type annotations in conftest.py by @DarkLight1337 in #5118
  • [Core] Support image processor by @DarkLight1337 in #4197
  • [Core] Remove unnecessary copies in flash attn backend by @Yard1 in #5138
  • [Kernel] Pass a device pointer into the quantize kernel for the scales by @tlrmchlsmth in #5159
  • [CI/BUILD] enable intel queue for longer CPU tests by @zhouyuan in #4113
  • [Misc]: Implement CPU/GPU swapping in BlockManagerV2 by @Kaiyang-Chen in #3834
  • New CI template on AWS stack by @khluu in #5110
  • [FRONTEND] OpenAI tools support named functions by @br3no in #5032
  • [Bugfix] Support prompt_logprobs==0 by @toslunar in #5217
  • [Bugfix] Add warmup for prefix caching example by @zhuohan123 in #5235
  • [Kernel] Enhance MoE benchmarking & tuning script by @WoosukKwon in #4921
  • [Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend by @afeldman-nm in #5210
  • [Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor by @zifeitong in #5229
  • [CI/Build] Add inputs tests by @DarkLight1337 in #5215
  • [Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend by @DamonFool in #5249
  • [Kernel] Add back batch size 1536 and 3072 to MoE tuning by @WoosukKwon in #5242
  • [CI/Build] Simplify model loading for HfRunner by @DarkLight1337 in #5251
  • [CI/Build] Reducing CPU CI execution time by @bigPYJ1151 in #5241
  • [CI] mark AMD test as softfail to prevent blockage by @simon-mo in #5256
  • [Misc] Add transformers version to collect_env.py by @mgoin in #5259
  • [Misc] update collect env by @youkaichao in #5261
  • [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True by @zifeitong in #5226
  • [Misc] Add CustomOp interface for device portability by @WoosukKwon in #5255
  • [Misc] Fix docstring of get_attn_backend by @WoosukKwon in #5271
  • [Frontend] OpenAI API server: Add add_special_tokens to ChatCompletionRequest (default False) by @tomeras91 in #5278
  • [CI] Add nightly benchmarks by @simon-mo in #5260
  • [misc] benchmark_serving.py -- add ITL results and tweak TPOT results by @tlrmchlsmth in #5263
  • [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size by @tlrmchlsmth in #5157
  • [Model] Correct Mixtral FP8 checkpoint loading by @comaniac in #5231
  • [BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM by @DriverSong in #5207
  • [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 by @pcmoritz in #5238
  • [Docs] Add Sequoia as sponsors by @simon-mo in #5287
  • [Speculative Decoding] Add ProposerWorkerBase abstract class by @njhill in #5252
  • [BugFix] Fix log message about default max model length by @njhill in #5284
  • [Bugfix] Make EngineArgs use named arguments for config construction by @mgoin in #5285
  • [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. by @wuisawesome in #5290
  • [Misc] Skip for logits_scale == 1.0 by @WoosukKwon in #5291
  • [Docs] Add Ray Summit CFP by @simon-mo in #5295
  • [CI] Disable flash_attn backend for spec decode by @simon-mo in #5286
  • [Frontend][Core] Update Outlines Integration from FSM to Guide by @br3no in #4109
  • [CI/Build] Update vision tests by @DarkLight1337 in #5307
  • Bugfix: fix broken of download models from modelscope by @liuyhwangyh in #5233
  • [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 by @pcmoritz in #5294
  • [Frontend] enable passing multiple LoRA adapters at once to generate() by @mgoldey in #5300
  • [Core] Avoid copying prompt/output tokens if no penalties are used by @Yard1 in #5289
  • [Core] Change LoRA embedding sharding to support loading methods by @Yard1 in #5038
  • [Misc] Missing error message for custom ops import by @DamonFool in #5282
  • [Feature][Frontend]: Add support for stream_options in ChatCompletionRequest by @Etelis in #5135
  • [Misc][Utils] allow get_open_port to be called for multiple times by @youkaichao in #5333
  • [Kernel] Switch fp8 layers to use the CUTLASS kernels by @tlrmchlsmth in #5183
  • Remove Ray health check by @Yard1 in #4693
  • Addition of lacked ignored_seq_groups in _schedule_chunked_prefill by @JamesLim-sy in #5296
  • [Kernel] Dynamic Per-Token Activation Quantization by @dsikka in #5037
  • [Frontend] Add OpenAI Vision API Support by @ywang96 in #5237
  • [Misc] Remove unused cuda_utils.h in CPU backend by @DamonFool in #5345
  • fix DbrxFusedNormAttention missing cache_config by @Calvinnncy97 in https://github.com/vllm-...
Read more

v0.4.3

01 Jun 00:25
1197e02
Compare
Choose a tag to compare

Highlights

Model Support

LLM

  • Added support for Falcon (#5069)
  • Added support for IBM Granite Code models (#4636)
  • Added blocksparse flash attention kernel and Phi-3-Small model (#4799)
  • Added Snowflake arctic model implementation (#4652, #4889, #4690)
  • Supported Dynamic RoPE scaling (#4638)
  • Supported for long context lora (#4787)

Embedding Models

  • Intial support for Embedding API with e5-mistral-7b-instruct (#3734)
  • Cross-attention KV caching and memory-management towards encoder-decoder model support (#4837)

Vision Language Model

  • Add base class for vision-language models (#4809)
  • Consolidate prompt arguments to LLM engines (#4328)
  • LLaVA model refactor (#4910)

Hardware Support

AMD

  • Add fused_moe Triton configs (#4951)
  • Add support for Punica kernels (#3140)
  • Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)

Production Engine

Batch API

  • Support OpenAI batch file format (#4794)

Making Ray Optional

  • Add MultiprocessingGPUExecutor (#4539)
  • Eliminate parallel worker per-step task scheduling overhead (#4894)

Automatic Prefix Caching

  • Accelerating the hashing function by avoiding deep copies (#4696)

Speculative Decoding

  • CUDA graph support (#4295)
  • Enable TP>1 speculative decoding (#4840)
  • Improve n-gram efficiency (#4724)

Performance Optimization

Quantization

  • Add GPTQ Marlin 2:4 sparse structured support (#4790)
  • Initial Activation Quantization Support (#4525)
  • Marlin prefill performance improvement (about better on average) (#4983)
  • Automatically Detect SparseML models (#5119)

Better Attention Kernel

  • Use flash-attn for decoding (#3648)

FP8

  • Improve FP8 linear layer performance (#4691)
  • Add w8a8 CUTLASS kernels (#4749)
  • Support for CUTLASS kernels in CUDA graphs (#4954)
  • Load FP8 kv-cache scaling factors from checkpoints (#4893)
  • Make static FP8 scaling more robust (#4570)
  • Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)

Optimize Distributed Communication

  • change python dict to pytorch tensor (#4607)
  • change python dict to pytorch tensor for blocks to swap (#4659)
  • improve paccess check (#4992)
  • remove vllm-nccl (#5091)
  • support both cpu and device tensor in broadcast tensor dict (#4660)

Extensible Architecture

Pipeline Parallelism

  • refactor custom allreduce to support multiple tp groups (#4754)
  • refactor pynccl to hold multiple communicators (#4591)
  • Support PP PyNCCL Groups (#4988)

What's Changed

Read more

v0.4.2

05 May 04:31
c7f2cf2
Compare
Choose a tag to compare

Highlights

Features

Models and Enhancements

Dependency Upgrade

  • Upgrade to torch==2.3.0 (#4454)
  • Upgrade to tensorizer==2.9.0 (#4467)
  • Expansion of AMD test suite (#4267)

Progress and Dev Experience

What's Changed

Read more

v0.4.1

24 Apr 02:28
468d761
Compare
Choose a tag to compare

Highlights

Features

  • Support and enhance CommandR+ (#3829), minicpm (#3893), Meta Llama 3 (#4175, #4182), Mixtral 8x22b (#4073, #4002)
  • Support private model registration, and updating our support policy (#3871, 3948)
  • Support PyTorch 2.2.1 and Triton 2.2.0 (#4061, #4079, #3805, #3904, #4271)
  • Add option for using LM Format Enforcer for guided decoding (#3868)
  • Add option for optionally initialize tokenizer and detokenizer (#3748)
  • Add option for load model using tensorizer (#3476)

Enhancements

Hardwares

  • Intel CPU inference backend is added (#3993, #3634)
  • AMD backend is enhanced with Triton kernel and e4m3fn KV cache (#3643, #3290)

What's Changed

  • [Kernel] Layernorm performance optimization by @mawong-amd in #3662
  • [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
  • [CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
  • [Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
  • [Misc] Some minor simplifications to detokenization logic by @njhill in #3670
  • [Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
  • [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
  • [Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
  • [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
  • [Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
  • [HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
  • [Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
  • [Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
  • [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
  • Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
  • [Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ by @mgoin in #3798
  • [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803
  • [Speculative decoding] Adding configuration object for speculative decoding by @cadedaniel in #3706
  • [BugFix] Use different mechanism to get vllm version in is_cpu() by @njhill in #3804
  • [Doc] Update README.md by @robertgshaw2-neuralmagic in #3806
  • [Doc] Update contribution guidelines for better onboarding by @michaelfeil in #3819
  • [3/N] Refactor scheduler for chunked prefill scheduling by @rkooo567 in #3550
  • Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) by @AdrianAbeyta in #3290
  • [Misc] Publish 3rd meetup slides by @WoosukKwon in #3835
  • Fixes the argument for local_tokenizer_group by @sighingnow in #3754
  • [Core] Enable hf_transfer by default if available by @michaelfeil in #3817
  • [Bugfix] Add kv_scale input parameter to CPU backend by @WoosukKwon in #3840
  • [Core] [Frontend] Make detokenization optional by @mgerstgrasser in #3749
  • [Bugfix] Fix args in benchmark_serving by @CatherineSue in #3836
  • [Benchmark] Refactor sample_requests in benchmark_throughput by @gty111 in #3613
  • [Core] manage nccl via a pypi package & upgrade to pt 2.2.1 by @youkaichao in #3805
  • [Hardware][CPU] Update cpu torch to match default of 2.2.1 by @mgoin in #3854
  • [Model] Cohere CommandR+ by @saurabhdash2512 in #3829
  • [Core] improve robustness of pynccl by @youkaichao in #3860
  • [Doc]Add asynchronous engine arguments to documentation. by @SeanGallen in #3810
  • [CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels by @youkaichao in #3859
  • [Misc] Add pytest marker to opt-out of global test cleanup by @cadedaniel in #3863
  • [Misc] Fix linter issues in examples/fp8/quantizer/quantize.py by @cadedaniel in #3864
  • [Bugfix] Fixing requirements.txt by @noamgat in #3865
  • [Misc] Define common requirements by @WoosukKwon in #3841
  • Add option to completion API to truncate prompt tokens by @tdoublep in #3144
  • [Chunked Prefill][4/n] Chunked prefill scheduler. by @rkooo567 in #3853
  • [Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism by @Isotr0py in #3869
  • [CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark by @youkaichao in #3889
  • [Core] enable out-of-tree model register by @youkaichao in #3871
  • [WIP][Core] latency optimization by @youkaichao in #3890
  • [Bugfix] Fix Llava inference with Tensor Parallelism. by @Isotr0py in #3883
  • [Model] add minicpm by @SUDA-HLT-ywfang in #3893
  • [Bugfix] Added Command-R GPTQ support by @egortolmachev in #3849
  • [Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration by @Ki6an in #3767
  • [Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations by @mawong-amd in #3782
  • [BugFix][Model] Fix commandr RoPE max_position_embeddings by @esmeetu in #3919
  • [Core] separate distributed_init from worker by @youkaichao in #3904
  • [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" by @cadedaniel in #3837
  • [Bugfix] Fix KeyError on loading GPT-NeoX by @jsato8094 in #3925
  • [ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm by @jpvillam-amd in #3643
  • [Misc] Avoid loading incorrect LoRA config by @jeejeelee in #3777
  • [Benchmark] Add cpu options to bench scripts by @PZD-CHINA in #3915
  • [Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable by @zhaotyer in #3955
  • [Bugfix] Fix logits processor when prompt_logprobs is not None by @huyiwen in #3899
  • [Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty by @tjohnson31415 in #3876
  • [Bugfix][ROCm] Add numba to Dockerfile.rocm by @WoosukKwon in #3962
  • [Model][AMD] ROCm support for 256 head dims for Gemma by @jamestwhedbee in #3972
  • [Doc] Add doc to state our model support policy by @youkaichao in #3948
  • [Bugfix] Remove key sorting for guided_json parameter in OpenAi compatible Server by @dmarasco in #3945
  • [Doc] Fix getting stared to use publicly available model by @fpaupier in #3963
  • [Bugfix] handle hf_config with architectures == None by @tjohnson31415 in #3982
  • [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators by @youkaichao in #3950
  • [Core][5/N] Fully working chunked prefill e2e by @rkooo567 in #3884
  • [Core][Model] Use torch.compile to accelerate layernorm in commandr by @youkaichao in #3985
  • [Test] Add xformer and flash attn tests by @rkooo567 in #3961
  • [Misc] refactor ops and cache_ops layer by @jikunshang in #3913
  • [Doc][Installation] delete python setup.py develop by @youkaichao in #3989
  • [Ke...
Read more

v0.4.0.post1, restore sm70/75 support

02 Apr 20:01
a3c226e
Compare
Choose a tag to compare

Highlight

v0.4.0 lacks support for sm70/75 support. We did a hotfix for it.

What's Changed

  • [Kernel] Layernorm performance optimization by @mawong-amd in #3662
  • [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
  • [CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
  • [Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
  • [Misc] Some minor simplifications to detokenization logic by @njhill in #3670
  • [Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
  • [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
  • [Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
  • [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
  • [Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
  • [HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
  • [Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
  • [Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
  • [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
  • Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
  • [Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ by @mgoin in #3798
  • [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803

New Contributors

Full Changelog: v0.4.0...v0.4.0.post1

v0.4.0

30 Mar 01:54
51c31bc
Compare
Choose a tag to compare

Major changes

Models

Production features

  • Automatic prefix caching (#2762, #3703) supporting long system prompt to be automatically cached across requests. Use the flag --enable-prefix-caching to turn it on.
  • Support json_object in OpenAI server for arbitrary JSON, --use-delay flag to improve time to first token across many requests, and min_tokens to EOS suppression.
  • Progress in chunked prefill scheduler (#3236, #3538), and speculative decoding (#3103).
  • Custom all reduce kernel has been re-enabled after more robustness fixes.
  • Replaced cupy dependency due to its bugs.

Hardware

  • Improved Neuron support for AWS Inferentia.
  • CMake based build system for extensibility.

Ecosystem

  • Extensive serving benchmark refactoring (#3277)
  • Usage statistics collection (#2852)

What's Changed

Read more