Skip to content

v0.4.3

Compare
Choose a tag to compare
@github-actions github-actions released this 01 Jun 00:25
· 1969 commits to main since this release
1197e02

Highlights

Model Support

LLM

  • Added support for Falcon (#5069)
  • Added support for IBM Granite Code models (#4636)
  • Added blocksparse flash attention kernel and Phi-3-Small model (#4799)
  • Added Snowflake arctic model implementation (#4652, #4889, #4690)
  • Supported Dynamic RoPE scaling (#4638)
  • Supported for long context lora (#4787)

Embedding Models

  • Intial support for Embedding API with e5-mistral-7b-instruct (#3734)
  • Cross-attention KV caching and memory-management towards encoder-decoder model support (#4837)

Vision Language Model

  • Add base class for vision-language models (#4809)
  • Consolidate prompt arguments to LLM engines (#4328)
  • LLaVA model refactor (#4910)

Hardware Support

AMD

  • Add fused_moe Triton configs (#4951)
  • Add support for Punica kernels (#3140)
  • Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797)

Production Engine

Batch API

  • Support OpenAI batch file format (#4794)

Making Ray Optional

  • Add MultiprocessingGPUExecutor (#4539)
  • Eliminate parallel worker per-step task scheduling overhead (#4894)

Automatic Prefix Caching

  • Accelerating the hashing function by avoiding deep copies (#4696)

Speculative Decoding

  • CUDA graph support (#4295)
  • Enable TP>1 speculative decoding (#4840)
  • Improve n-gram efficiency (#4724)

Performance Optimization

Quantization

  • Add GPTQ Marlin 2:4 sparse structured support (#4790)
  • Initial Activation Quantization Support (#4525)
  • Marlin prefill performance improvement (about better on average) (#4983)
  • Automatically Detect SparseML models (#5119)

Better Attention Kernel

  • Use flash-attn for decoding (#3648)

FP8

  • Improve FP8 linear layer performance (#4691)
  • Add w8a8 CUTLASS kernels (#4749)
  • Support for CUTLASS kernels in CUDA graphs (#4954)
  • Load FP8 kv-cache scaling factors from checkpoints (#4893)
  • Make static FP8 scaling more robust (#4570)
  • Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)

Optimize Distributed Communication

  • change python dict to pytorch tensor (#4607)
  • change python dict to pytorch tensor for blocks to swap (#4659)
  • improve paccess check (#4992)
  • remove vllm-nccl (#5091)
  • support both cpu and device tensor in broadcast tensor dict (#4660)

Extensible Architecture

Pipeline Parallelism

  • refactor custom allreduce to support multiple tp groups (#4754)
  • refactor pynccl to hold multiple communicators (#4591)
  • Support PP PyNCCL Groups (#4988)

What's Changed

New Contributors

Full Changelog: v0.4.2...v0.4.3