v0.6.1
Highlights
Model Support
- Added support for Pixtral (
mistralai/Pixtral-12B-2409
). (#8377, #8168) - Added support for Llava-Next-Video (#7559), Qwen-VL (#8029), Qwen2-VL (#7905)
- Multi-input support for LLaVA (#8238), InternVL2 models (#8201)
Performance Enhancements
- Memory optimization for awq_gemm and awq_dequantize, 2x throughput (#8248)
Production Engine
- Support load and unload LoRA in api server (#6566)
- Add progress reporting to batch runner (#8060)
- Add support for NVIDIA ModelOpt static scaling checkpoints. (#6112)
Others
- Update the docker image to use Python 3.12 for small performance bump. (#8133)
- Added CODE_OF_CONDUCT.md (#8161)
What's Changed
- [Doc] [Misc] Create CODE_OF_CONDUCT.md by @mmcelaney in #8161
- [bugfix] Upgrade minimum OpenAI version by @SolitaryThinker in #8169
- [Misc] Clean up RoPE forward_native by @WoosukKwon in #8076
- [ci] Mark LoRA test as soft-fail by @khluu in #8160
- [Core/Bugfix] Add query dtype as per FlashInfer API requirements. by @elfiegg in #8173
- [Doc] Add multi-image input example and update supported models by @DarkLight1337 in #8181
- Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) by @Manikandan-Thangaraj-ZS0321 in #7860
- [MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) by @alex-jw-brooks in #8029
- Move verify_marlin_supported to GPTQMarlinLinearMethod by @mgoin in #8165
- [Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM by @sroy745 in #7962
- [Core] Support load and unload LoRA in api server by @Jeffwan in #6566
- [BugFix] Fix Granite model configuration by @njhill in #8216
- [Frontend] Add --logprobs argument to
benchmark_serving.py
by @afeldman-nm in #8191 - [Misc] Use ray[adag] dependency instead of cuda by @ruisearch42 in #7938
- [CI/Build] Increasing timeout for multiproc worker tests by @alexeykondrat in #8203
- [Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput by @rasmith in #8248
- [Misc] Remove
SqueezeLLM
by @dsikka in #8220 - [Model] Allow loading from original Mistral format by @patrickvonplaten in #8168
- [misc] [doc] [frontend] LLM torch profiler support by @SolitaryThinker in #7943
- [Bugfix] Fix Hermes tool call chat template bug by @K-Mistele in #8256
- [Model] Multi-input support for LLaVA and fix embedding inputs for multi-image models by @DarkLight1337 in #8238
- Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) by @wschin in #8241
- [tpu][misc] fix typo by @youkaichao in #8260
- [Bugfix] Fix broken OpenAI tensorizer test by @DarkLight1337 in #8258
- [Model][VLM] Support multi-images inputs for InternVL2 models by @Isotr0py in #8201
- [Model][VLM] Decouple weight loading logic for
Paligemma
by @Isotr0py in #8269 - ppc64le: Dockerfile fixed, and a script for buildkite by @sumitd2 in #8026
- [CI/Build] Use python 3.12 in cuda image by @joerunde in #8133
- [Bugfix] Fix async postprocessor in case of preemption by @alexm-neuralmagic in #8267
- [Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility by @K-Mistele in #8272
- [Frontend] Add progress reporting to run_batch.py by @alugowski in #8060
- [Bugfix] Correct adapter usage for cohere and jamba by @vladislavkruglikov in #8292
- [Misc] GPTQ Activation Ordering by @kylesayrs in #8135
- [Misc] Fused MoE Marlin support for GPTQ by @dsikka in #8217
- Add NVIDIA Meetup slides, announce AMD meetup, and add contact info by @simon-mo in #8319
- [Bugfix] Fix missing
post_layernorm
in CLIP by @DarkLight1337 in #8155 - [CI/Build] enable ccache/scccache for HIP builds by @dtrifiro in #8327
- [Frontend] Clean up type annotations for mistral tokenizer by @DarkLight1337 in #8314
- [CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail by @alexeykondrat in #8130
- Fix ppc64le buildkite job by @sumitd2 in #8309
- [Spec Decode] Move ops.advance_step to flash attn advance_step by @kevin314 in #8224
- [Misc] remove peft as dependency for prompt models by @prashantgupta24 in #8162
- [MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled by @comaniac in #8342
- [Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture by @alexm-neuralmagic in #8340
- [Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers by @SolitaryThinker in #8172
- [CI/Build][Kernel] Update CUTLASS to 3.5.1 tag by @tlrmchlsmth in #8043
- [Misc] Skip loading extra bias for Qwen2-MOE GPTQ models by @jeejeelee in #8329
- [Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel by @Isotr0py in #8299
- [Hardware][NV] Add support for ModelOpt static scaling checkpoints. by @pavanimajety in #6112
- [model] Support for Llava-Next-Video model by @TKONIY in #7559
- [Frontend] Create ErrorResponse instead of raising exceptions in run_batch by @pooyadavoodi in #8347
- [Model][VLM] Add Qwen2-VL model support by @fyabc in #7905
- [Hardware][Intel] Support compressed-tensor W8A8 for CPU backend by @bigPYJ1151 in #7257
- [CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation by @alexeykondrat in #8373
- [Bugfix] Add missing attributes in mistral tokenizer by @DarkLight1337 in #8364
- [Kernel][Misc] Add meta functions for ops to prevent graph breaks by @bnellnm in #6917
- [Misc] Move device options to a single place by @akx in #8322
- [Speculative Decoding] Test refactor by @LiuXiaoxuanPKU in #8317
- Pixtral by @patrickvonplaten in #8377
- Bump version to v0.6.1 by @simon-mo in #8379
New Contributors
- @mmcelaney made their first contribution in #8161
- @elfiegg made their first contribution in #8173
- @Manikandan-Thangaraj-ZS0321 made their first contribution in #7860
- @sumitd2 made their first contribution in #8026
- @alugowski made their first contribution in #8060
- @vladislavkruglikov made their first contribution in #8292
- @kevin314 made their first contribution in #8224
- @TKONIY made their first contribution in #7559
- @akx made their first contribution in #8322
Full Changelog: v0.6.0...v0.6.1