Releases: vllm-project/vllm-spyre
v0.6.0
This release:
- 🎉 Supports embedding models on vLLM v1!
- 🔥 Removes all remaining support for vLLM v0
- ⚡ Contains performance and stability fixes for continuous batching
- ⚗️ Support for up to
--max-num-seqs 4 --max-model-len 8192 --tensor-parallel-size 4
has been tested on ibm-granite/granite-3.3-8b-instruct
- ⚗️ Support for up to
- 📦 Officially supports vllm 0.9.2 and 0.10.0
What's Changed
- [SB] relax constraint on min number of new tokens by @yannicks1 in #322
- [CB] bug fix: account for prefill token by @yannicks1 in #320
- Documents a bit CB script and tests by @sducouedic in #300
- 🧪 add long context test by @joerunde in #330
- [docs] Add install from PyPI to docs by @ckadner in #327
- ⬆️ bump base image by @joerunde in #328
- [ppc64le] Introduce ppc64le benchmarking scripts by @Daniel-Schenker in #311
- [CB] Override number of Spyre blocks: replace env var with top level argument by @yannicks1 in #331
- [CB] Add scheduling tests by @sducouedic in #329
- 🎨 add values in test asserts by @prashantgupta24 in #333
- [CB] Refactoring/Cleaning up prepare_prompt/decode by @yannicks1 in #335
- feat: enable FP8 quantized models loading by @rafvasq in #316
- ♻️ Compatibility with vllm main by @prashantgupta24 in #338
- V1 embeddings by @maxdebayser in #277
- feat: detect CPUs and configure threading sensibly by @tjohnson31415 in #291
- [CB] Support pseudo batch size 1 for decode, adjust warmup by @yannicks1 in #287
- fix introduced merge conflict on main by @yannicks1 in #345
- Add CB API tests on the correct use of max_tokens by @gmarinho2 in #339
- ♻️ fix vllm:main by @prashantgupta24 in #341
- [CB] Optimization: Reduce wastage in prefill compute and pad blocks in homogeneous continuous batching by @yannicks1 in #262
- [CI] Tests for graph comparison between vllm and AFTU by @wallashss in #286
- [CB] refactoring warmup for batch size 1 by @yannicks1 in #347
- [CB][Tests] Check output of scheduling tests on Spyre by @sducouedic in #337
- [v1] remove v0 code by @yannicks1 in #344
- ♻️ enable offline mode in GHA tests by @prashantgupta24 in #349
- ⬆️ bump base image with more CB fixes by @joerunde in #351
- Upstream compatibility tests by @maxdebayser in #343
- ⬆️ Bump locked vllm to 0.10.0 by @joerunde in #352
New Contributors
- @Daniel-Schenker made their first contribution in #311
Full Changelog: v0.5.3...v0.6.0
v0.5.3
This release contains test updates and fixes for continuous batching, and a small logging improvement
What's Changed
- make truncation of token lists optional in example script by @maxdebayser in #317
- [Fix][Tests] TP param used in tests unconditionally by @rafvasq in #315
- Print compile cache enablement along with warmup time by @sducouedic in #321
- ✅ add assertions for warmup mode context by @prashantgupta24 in #294
- fix off by one error by @maxdebayser in #324
- 🐛 fix cb online test by @joerunde in #326
- [CB] Update CB docs + Refactoring scheduling step-by-step inference tests by @sducouedic in #323
Full Changelog: v0.5.2...v0.5.3
v0.5.2
What's Changed
- add long context example by @maxdebayser in #304
- Integrate upstream logits processors by @maxdebayser in #290
- [Fix] Tests breaking with vLLM:Main by @sducouedic in #306
- [SB] fix order of warmup print by @yannicks1 in #309
- [Docs] Use mkdocstrings for CB tests by @rafvasq in #308
- [CB] Scheduling constraints regarding number of available blocks/pages by @yannicks1 in #261
- remove
VLLM_ENABLE_V1_MULTIPROCESSING
disabling by @sducouedic in #302 - 🐛 use right attention name by @joerunde in #310
- 🐛 Workaround ray issue in tests by @joerunde in #307
- 🐛 fix max tokens for continuous batching by @joerunde in #314
- ✅ add TP with CB test by @prashantgupta24 in #303
- removing legacy backward compatibility by @yannicks1 in #313
Full Changelog: v0.5.1...v0.5.2
v0.5.1
This release:
- Fixes Tensor parallel support for static batching
Known Issues
Tensor parallel support seems to be still broken for continuous batching
What's Changed
- [CB][Tests] Add CB online test and refactor multi tests by @rafvasq in #279
- [SB] parametrize offline examples by @yannicks1 in #298
- silence warning in pytest due to string conversion by @yannicks1 in #299
- 🐛 fix tensor parallel by @joerunde in #301
Full Changelog: v0.5.0...v0.5.1
v0.5.0
This release:
- Introduces breaking changes brought in by vllm-upstream
0.9.2
- Supports prompt logprobs with static batching
Known Issues
Tensor parallel support is broken, look for a bugfix release soon
What's Changed
- ✨ parameterize max-tokens by @joerunde in #282
- Respect vllm logging configs in vllm_spyre by @kazunoriogata in #281
- ✨ Support prompt logprobs with static batching by @joerunde in #274
- Duplicate the SamplingMetadata class by @maxdebayser in #278
- Update warmup log messages and comments by @tjohnson31415 in #284
- vllm main updates by @prashantgupta24 in #283
- [tests] add cb parameterization by @prashantgupta24 in #289
- 🐛 put decode back in warmup by @joerunde in #293
- Use VLLM_WORKER_MULTIPROC_METHOD=spawn instead of --forked for tests by @tjohnson31415 in #268
- Add get_max_output_tokens for class SpyrePlatform by @gmarinho2 in #179
- Small updates for cb tests by @joerunde in #285
- ⬆️ bump base image by @joerunde in #296
- 🐛 add pytest-forked dev dep back by @joerunde in #297
New Contributors
- @kazunoriogata made their first contribution in #281
Full Changelog: v0.4.1...v0.5.0
v0.4.1
This release:
- Includes a critical bugfix for batch handling with continuous batching
- Fixes a bug where the first prompt after warmup would take a long time with continuous batching
- Fixes a bug where canceling requests could crash the server
What's Changed
- [Priority merge] NewRequestData parameter introduced in vllm upstream by @sducouedic in #245
- Use hugging face as baseline to test CB output by @sducouedic in #240
- fix: avoid KeyError when cancelling requests that have not been processed by @tjohnson31415 in #233
- ✅ CB tests refactoring + adding batch test by @prashantgupta24 in #257
- [v0] replace current_platform with SpyrePlatform by @yannicks1 in #263
- 🍱 Swap tests to tiny granite by @joerunde in #264
- [CB] refactoring spyre model runner by @wallashss in #172
- [CB] remove reference to outdated fms feature branch by @yannicks1 in #269
- [CB] use used block ids for dummy batch size 2 by @yannicks1 in #259
- [CB] additional prefill in warmup to fix TTFT by @yannicks1 in #270
- [Docs] Remove xgrammar install step by @rafvasq in #275
- ⚗️ add more prompts and cpu validation by @joerunde in #276
Full Changelog: v0.4.0...v0.4.1
v0.4.0
This release:
- ➕ Adds support for ibm-fms 1.1.0
- ➕ Adds support for the latest compiler updates in the newest base image
- ❗ Removes v0 support for text generation
- ⚗️ Adds (very experimental) support for continuous batching mode on spyre hardware
This release is not compatible with vllm==0.9.1
, read more details here
What's Changed
- [CI] Don't skip tests when
uv.lock
is updated by @ckadner in #221 - [CI] Use
uv
for type-check by @ckadner in #222 - ✨ add top-level spyre version by @prashantgupta24 in #224
- [CB] parametrize example script by @sducouedic in #228
- Clean up examples and PR template by @rafvasq in #227
- 🔥🔧 Remove environment variables specific to hardware conf by @gkumbhat in #229
- [CI] Only build docker image on source changes by @ckadner in #220
- [CB] remove VLLM_SPYRE_RM_PADDED_BLOCKS, enable the feature by default by @yannicks1 in #231
- [do not merge][CB] get number of blocks from compiler mock implementation by @yannicks1 in #205
- Exclude vllm v0.9.1 as an allowed version due to breaking bug by @tjohnson31415 in #232
- 🐛 add initialize_cache for v1 worker by @prashantgupta24 in #237
- [tests][CB][SB] minor refactoring of test by @yannicks1 in #239
- 📝 update deployment examples, add kserve by @joerunde in #226
- [Test] CB rejects requests longer than max length by @rafvasq in #236
- [FIX] lazy import of SpyreCausalLM to avoid issues with pytest-forked by @wallashss in #238
- [docs] add debugging docs by @prashantgupta24 in #235
- Support both paged and non-paged attention by @yannicks1 in #162
- [refact] Remove V0 tests by @wallashss in #241
- 🥅 disable v0 decoders by @joerunde in #242
- 🐛 fix runtime msg by @prashantgupta24 in #244
- 🐛 fixed static batch warmup by @joerunde in #246
- ⬆️ upgrade base image for release by @joerunde in #250
- Remove unused DT_OPT by @joerunde in #251
Full Changelog: v0.3.1...v0.4.0
v0.3.1
This bugfix release addresses two important issues:
- Fixes a configuration bug with tensor-parallel inference on the public
quay.io/ibm-aiu/vllm-spyre
image, causing 0.3.0 to fail - Fixes a bug where full static batches of long prompts could not be scheduled
What's Changed
- 🔖 Add release trigger for docker build by @joerunde in #203
- Add PR template by @rafvasq in #201
- [CI] Skip tests on doc changes by @ckadner in #193
- [FIX] Suppression of stacktrace on a shutdown by @wallashss in #187
- Update README.md by @ckadner in #194
- Update vllm dependency to >=v0.9.0 by @sducouedic in #208
- Create CODEOWNERS file by @ckadner in #197
- [DOCS] replace calendar emoji in supported features by @wallashss in #207
- 🐛 fix for upstream change by @prashantgupta24 in #210
- ✅ add more TP sizes to tests by @joerunde in #209
- 🐛 fix static scheduling issues with long prompts by @joerunde in #206
- [CB] Test continuous batching through the scheduler by @sducouedic in #199
- Deprecate sendnn_decoder in favor of sendnn with warmup_mode by @tjohnson31415 in #186
- Don't allow warmup shapes that exceed the max sequence length by @maxdebayser in #185
- 🐛 add required ignore modules for tensor parallel by @joerunde in #212
- 🐛 fetch tags for versioning in docker build by @joerunde in #214
- ⬆️ Update locked packages by @joerunde in #213
- [CB] add min batch size of 2 in decode by @nikolaospapandreou in #182
- [CB] refactor left padding removal by @yannicks1 in #211
New Contributors
- @maxdebayser made their first contribution in #185
Full Changelog: v0.3.0...v0.3.1
v0.3.0
This release:
- Updates vLLM compatibility to 0.9.0.1
- Adds vllm profiler support
- Supports multi-spyre setups with tensor parallel out of the box
What's Changed
- 🔥 remove supported models by @joerunde in #163
- [CB] test scheduler tkv by @sducouedic in #156
- ✨ add doc lint fix by @prashantgupta24 in #164
- [CI/CD] mark current tests that are failing when compile cache is enabled by @sducouedic in #171
- [Docs] Add overview and examples by @rafvasq in #159
- Disable compile cache in test_spyre_warmup_shapes by @sducouedic in #174
- [Docs] 🐛 fix lint command by @prashantgupta24 in #170
- ⬆️ bump fms to 1.0 by @joerunde in #169
- 🎨 use sync with inexact instead for lint fix? by @prashantgupta24 in #175
- 🐛 ignore shell script files within .venv for shellcheck command by @prashantgupta24 in #177
- Revert commit (disabling caching in test doesn't work) by @sducouedic in #180
- 📝 docs for local development on CPU by @prashantgupta24 in #161
- [Docs] Supported Features by @wallashss in #178
- 📝 Build and run docs by @joerunde in #160
- [CI] Minor cleanup and more consistent workflow names by @ckadner in #158
- [profiler] support PyTorch profiler enablement by @mcalman in #176
- [CI] Ignore docs for tests by @wallashss in #181
- [Docs] Update contributing docs by @rafvasq in #184
- 🐛 fix for upstream compatibility - use LLM.embed() instead for embeddings by @prashantgupta24 in #188
- Simplify spyre_setup.py and fix distributed setup by @tjohnson31415 in #190
- [Docs] Migrate from Sphinx to MkDocs by @rafvasq in #189
- 🐛 fix return type for update_from_output by @prashantgupta24 in #192
- 🔥 remove unused dd2 target by @joerunde in #196
- [Docs] Update main README.md by @rafvasq in #200
- Updates for release prep by @joerunde in #202
New Contributors
Full Changelog: v0.2.0...v0.3.0
v0.2.0
This release
- Updates vllm compatibility to ~=0.8.5
- Adds support for sampling parameters for continuous batching
- Uses standard vllm config for continuous batching parameters
What's Changed
- Fixes to get things working with upstream main by @tjohnson31415 in #123
- [CB] proper cleanup after warmup by @yannicks1 in #130
- Refactor ModelInputForSpyre dataclass by @prashantgupta24 in #107
- 🔥 Bump vLLM and remove scheduler rejection logic by @joerunde in #132
- 📝 Add release docs by @joerunde in #124
- [CB] supporting prompts spanning multiple blocks by @yannicks1 in #128
- [CB] strip repeated left padding on batch level by @yannicks1 in #131
- [CB] Scheduler constraints by @sducouedic in #129
- [CB] remove unnecessary marked dynamic dimensions by @yannicks1 in #135
- ♻️ fix needed for vllm main by @prashantgupta24 in #138
- 🐛 Fix new token minimum requirement error message by @gkumbhat in #137
- [CB] 🔥 remove cb env vars by @prashantgupta24 in #114
- fix: Add validation and test for prompt len % 64 by @rafvasq in #139
- Fix local development for vllm==0.8.5 by @wallashss in #140
- [Docs] Publish documentation site by @rafvasq in #141
- Fix test pypi publications by @joerunde in #144
- [CB] Continuous batching support on spyre input batch by @wallashss in #126
- 🐛 set default value for tkv by @prashantgupta24 in #153
- [CB ] e2e continuous batching tests by @prashantgupta24 in #79
- [CB] Update spyre model runner for new spyre input batch by @wallashss in #127
- Add options to torch.compile by @tdoublep in #149
- Add OS-related docs by @rafvasq in #152
- 📝 Document plugin configuration by @joerunde in #157
New Contributors
Full Changelog: v0.1.0...v0.2.0