Releases · vllm-project/vllm-spyre

31 Jul 22:28

joerunde

v0.6.0

a7bc26b

v0.6.0 Latest

Latest

This release:

🎉 Supports embedding models on vLLM v1!
🔥 Removes all remaining support for vLLM v0
⚡ Contains performance and stability fixes for continuous batching
- ⚗️ Support for up to --max-num-seqs 4 --max-model-len 8192 --tensor-parallel-size 4 has been tested on ibm-granite/granite-3.3-8b-instruct
📦 Officially supports vllm 0.9.2 and 0.10.0

What's Changed

[SB] relax constraint on min number of new tokens by @yannicks1 in #322
[CB] bug fix: account for prefill token by @yannicks1 in #320
Documents a bit CB script and tests by @sducouedic in #300
🧪 add long context test by @joerunde in #330
[docs] Add install from PyPI to docs by @ckadner in #327
⬆️ bump base image by @joerunde in #328
[ppc64le] Introduce ppc64le benchmarking scripts by @Daniel-Schenker in #311
[CB] Override number of Spyre blocks: replace env var with top level argument by @yannicks1 in #331
[CB] Add scheduling tests by @sducouedic in #329
🎨 add values in test asserts by @prashantgupta24 in #333
[CB] Refactoring/Cleaning up prepare_prompt/decode by @yannicks1 in #335
feat: enable FP8 quantized models loading by @rafvasq in #316
♻️ Compatibility with vllm main by @prashantgupta24 in #338
V1 embeddings by @maxdebayser in #277
feat: detect CPUs and configure threading sensibly by @tjohnson31415 in #291
[CB] Support pseudo batch size 1 for decode, adjust warmup by @yannicks1 in #287
fix introduced merge conflict on main by @yannicks1 in #345
Add CB API tests on the correct use of max_tokens by @gmarinho2 in #339
♻️ fix vllm:main by @prashantgupta24 in #341
[CB] Optimization: Reduce wastage in prefill compute and pad blocks in homogeneous continuous batching by @yannicks1 in #262
[CI] Tests for graph comparison between vllm and AFTU by @wallashss in #286
[CB] refactoring warmup for batch size 1 by @yannicks1 in #347
[CB][Tests] Check output of scheduling tests on Spyre by @sducouedic in #337
[v1] remove v0 code by @yannicks1 in #344
♻️ enable offline mode in GHA tests by @prashantgupta24 in #349
⬆️ bump base image with more CB fixes by @joerunde in #351
Upstream compatibility tests by @maxdebayser in #343
⬆️ Bump locked vllm to 0.10.0 by @joerunde in #352

New Contributors

@Daniel-Schenker made their first contribution in #311

Full Changelog: v0.5.3...v0.6.0

Contributors

maxdebayser, joerunde, and 9 other contributors

Assets 2

18 Jul 20:32

joerunde

v0.5.3

639295d

v0.5.3

This release contains test updates and fixes for continuous batching, and a small logging improvement

What's Changed

make truncation of token lists optional in example script by @maxdebayser in #317
[Fix][Tests] TP param used in tests unconditionally by @rafvasq in #315
Print compile cache enablement along with warmup time by @sducouedic in #321
✅ add assertions for warmup mode context by @prashantgupta24 in #294
fix off by one error by @maxdebayser in #324
🐛 fix cb online test by @joerunde in #326
[CB] Update CB docs + Refactoring scheduling step-by-step inference tests by @sducouedic in #323

Full Changelog: v0.5.2...v0.5.3

Contributors

maxdebayser, joerunde, and 3 other contributors

Assets 2

16 Jul 20:27

prashantgupta24

v0.5.2

377895d

v0.5.2

What's Changed

add long context example by @maxdebayser in #304
Integrate upstream logits processors by @maxdebayser in #290
[Fix] Tests breaking with vLLM:Main by @sducouedic in #306
[SB] fix order of warmup print by @yannicks1 in #309
[Docs] Use mkdocstrings for CB tests by @rafvasq in #308
[CB] Scheduling constraints regarding number of available blocks/pages by @yannicks1 in #261
remove VLLM_ENABLE_V1_MULTIPROCESSING disabling by @sducouedic in #302
🐛 use right attention name by @joerunde in #310
🐛 Workaround ray issue in tests by @joerunde in #307
🐛 fix max tokens for continuous batching by @joerunde in #314
✅ add TP with CB test by @prashantgupta24 in #303
removing legacy backward compatibility by @yannicks1 in #313

Full Changelog: v0.5.1...v0.5.2

Contributors

maxdebayser, joerunde, and 4 other contributors

Assets 2

11 Jul 17:47

prashantgupta24

v0.5.1

e9604ff

v0.5.1

This release:

Fixes Tensor parallel support for static batching

Known Issues

⚠️⚠️⚠️⚠️⚠️
Tensor parallel support seems to be still broken for continuous batching
⚠️⚠️⚠️⚠️⚠️

What's Changed

[CB][Tests] Add CB online test and refactor multi tests by @rafvasq in #279
[SB] parametrize offline examples by @yannicks1 in #298
silence warning in pytest due to string conversion by @yannicks1 in #299
🐛 fix tensor parallel by @joerunde in #301

Full Changelog: v0.5.0...v0.5.1

Contributors

joerunde, rafvasq, and yannicks1

Assets 2

09 Jul 22:01

prashantgupta24

v0.5.0

d571804

v0.5.0

This release:

Introduces breaking changes brought in by vllm-upstream 0.9.2
Supports prompt logprobs with static batching

Known Issues

⚠️⚠️⚠️⚠️⚠️

Tensor parallel support is broken, look for a bugfix release soon
⚠️⚠️⚠️⚠️⚠️

What's Changed

✨ parameterize max-tokens by @joerunde in #282
Respect vllm logging configs in vllm_spyre by @kazunoriogata in #281
✨ Support prompt logprobs with static batching by @joerunde in #274
Duplicate the SamplingMetadata class by @maxdebayser in #278
Update warmup log messages and comments by @tjohnson31415 in #284
vllm main updates by @prashantgupta24 in #283
[tests] add cb parameterization by @prashantgupta24 in #289
🐛 put decode back in warmup by @joerunde in #293
Use VLLM_WORKER_MULTIPROC_METHOD=spawn instead of --forked for tests by @tjohnson31415 in #268
Add get_max_output_tokens for class SpyrePlatform by @gmarinho2 in #179
Small updates for cb tests by @joerunde in #285
⬆️ bump base image by @joerunde in #296
🐛 add pytest-forked dev dep back by @joerunde in #297

New Contributors

@kazunoriogata made their first contribution in #281

Full Changelog: v0.4.1...v0.5.0

Contributors

maxdebayser, joerunde, and 4 other contributors

Assets 2

02 Jul 17:42

prashantgupta24

v0.4.1

3172162

v0.4.1

This release:

Includes a critical bugfix for batch handling with continuous batching
Fixes a bug where the first prompt after warmup would take a long time with continuous batching
Fixes a bug where canceling requests could crash the server

What's Changed

[Priority merge] NewRequestData parameter introduced in vllm upstream by @sducouedic in #245
Use hugging face as baseline to test CB output by @sducouedic in #240
fix: avoid KeyError when cancelling requests that have not been processed by @tjohnson31415 in #233
✅ CB tests refactoring + adding batch test by @prashantgupta24 in #257
[v0] replace current_platform with SpyrePlatform by @yannicks1 in #263
🍱 Swap tests to tiny granite by @joerunde in #264
[CB] refactoring spyre model runner by @wallashss in #172
[CB] remove reference to outdated fms feature branch by @yannicks1 in #269
[CB] use used block ids for dummy batch size 2 by @yannicks1 in #259
[CB] additional prefill in warmup to fix TTFT by @yannicks1 in #270
[Docs] Remove xgrammar install step by @rafvasq in #275
⚗️ add more prompts and cpu validation by @joerunde in #276

Full Changelog: v0.4.0...v0.4.1

Contributors

joerunde, wallashss, and 5 other contributors

Assets 2

19 Jun 23:11

joerunde

v0.4.0

2c295c8

v0.4.0

This release:

➕ Adds support for ibm-fms 1.1.0
➕ Adds support for the latest compiler updates in the newest base image
❗ Removes v0 support for text generation
⚗️ Adds (very experimental) support for continuous batching mode on spyre hardware

This release is not compatible with vllm==0.9.1, read more details here

What's Changed

[CI] Don't skip tests when uv.lock is updated by @ckadner in #221
[CI] Use uv for type-check by @ckadner in #222
✨ add top-level spyre version by @prashantgupta24 in #224
[CB] parametrize example script by @sducouedic in #228
Clean up examples and PR template by @rafvasq in #227
🔥🔧 Remove environment variables specific to hardware conf by @gkumbhat in #229
[CI] Only build docker image on source changes by @ckadner in #220
[CB] remove VLLM_SPYRE_RM_PADDED_BLOCKS, enable the feature by default by @yannicks1 in #231
[do not merge][CB] get number of blocks from compiler mock implementation by @yannicks1 in #205
Exclude vllm v0.9.1 as an allowed version due to breaking bug by @tjohnson31415 in #232
🐛 add initialize_cache for v1 worker by @prashantgupta24 in #237
[tests][CB][SB] minor refactoring of test by @yannicks1 in #239
📝 update deployment examples, add kserve by @joerunde in #226
[Test] CB rejects requests longer than max length by @rafvasq in #236
[FIX] lazy import of SpyreCausalLM to avoid issues with pytest-forked by @wallashss in #238
[docs] add debugging docs by @prashantgupta24 in #235
Support both paged and non-paged attention by @yannicks1 in #162
[refact] Remove V0 tests by @wallashss in #241
🥅 disable v0 decoders by @joerunde in #242
🐛 fix runtime msg by @prashantgupta24 in #244
🐛 fixed static batch warmup by @joerunde in #246
⬆️ upgrade base image for release by @joerunde in #250
Remove unused DT_OPT by @joerunde in #251

Full Changelog: v0.3.1...v0.4.0

Contributors

joerunde, wallashss, and 7 other contributors

Assets 2

05 Jun 22:03

joerunde

v0.3.1

a82191a

v0.3.1

This bugfix release addresses two important issues:

Fixes a configuration bug with tensor-parallel inference on the public quay.io/ibm-aiu/vllm-spyre image, causing 0.3.0 to fail
Fixes a bug where full static batches of long prompts could not be scheduled

What's Changed

🔖 Add release trigger for docker build by @joerunde in #203
Add PR template by @rafvasq in #201
[CI] Skip tests on doc changes by @ckadner in #193
[FIX] Suppression of stacktrace on a shutdown by @wallashss in #187
Update README.md by @ckadner in #194
Update vllm dependency to >=v0.9.0 by @sducouedic in #208
Create CODEOWNERS file by @ckadner in #197
[DOCS] replace calendar emoji in supported features by @wallashss in #207
🐛 fix for upstream change by @prashantgupta24 in #210
✅ add more TP sizes to tests by @joerunde in #209
🐛 fix static scheduling issues with long prompts by @joerunde in #206
[CB] Test continuous batching through the scheduler by @sducouedic in #199
Deprecate sendnn_decoder in favor of sendnn with warmup_mode by @tjohnson31415 in #186
Don't allow warmup shapes that exceed the max sequence length by @maxdebayser in #185
🐛 add required ignore modules for tensor parallel by @joerunde in #212
🐛 fetch tags for versioning in docker build by @joerunde in #214
⬆️ Update locked packages by @joerunde in #213
[CB] add min batch size of 2 in decode by @nikolaospapandreou in #182
[CB] refactor left padding removal by @yannicks1 in #211

New Contributors

@maxdebayser made their first contribution in #185

Full Changelog: v0.3.0...v0.3.1

Contributors

maxdebayser, joerunde, and 8 other contributors

Assets 2

03 Jun 15:45

joerunde

v0.3.0

2e7c154

v0.3.0

This release:

Updates vLLM compatibility to 0.9.0.1
Adds vllm profiler support
Supports multi-spyre setups with tensor parallel out of the box

What's Changed

🔥 remove supported models by @joerunde in #163
[CB] test scheduler tkv by @sducouedic in #156
✨ add doc lint fix by @prashantgupta24 in #164
[CI/CD] mark current tests that are failing when compile cache is enabled by @sducouedic in #171
[Docs] Add overview and examples by @rafvasq in #159
Disable compile cache in test_spyre_warmup_shapes by @sducouedic in #174
[Docs] 🐛 fix lint command by @prashantgupta24 in #170
⬆️ bump fms to 1.0 by @joerunde in #169
🎨 use sync with inexact instead for lint fix? by @prashantgupta24 in #175
🐛 ignore shell script files within .venv for shellcheck command by @prashantgupta24 in #177
Revert commit (disabling caching in test doesn't work) by @sducouedic in #180
📝 docs for local development on CPU by @prashantgupta24 in #161
[Docs] Supported Features by @wallashss in #178
📝 Build and run docs by @joerunde in #160
[CI] Minor cleanup and more consistent workflow names by @ckadner in #158
[profiler] support PyTorch profiler enablement by @mcalman in #176
[CI] Ignore docs for tests by @wallashss in #181
[Docs] Update contributing docs by @rafvasq in #184
🐛 fix for upstream compatibility - use LLM.embed() instead for embeddings by @prashantgupta24 in #188
Simplify spyre_setup.py and fix distributed setup by @tjohnson31415 in #190
[Docs] Migrate from Sphinx to MkDocs by @rafvasq in #189
🐛 fix return type for update_from_output by @prashantgupta24 in #192
🔥 remove unused dd2 target by @joerunde in #196
[Docs] Update main README.md by @rafvasq in #200
Updates for release prep by @joerunde in #202

New Contributors

@mcalman made their first contribution in #176

Full Changelog: v0.2.0...v0.3.0

Contributors

joerunde, wallashss, and 6 other contributors

Assets 2

16 May 15:31

joerunde

v0.2.0

85c688b

v0.2.0

This release

Updates vllm compatibility to ~=0.8.5
Adds support for sampling parameters for continuous batching
Uses standard vllm config for continuous batching parameters

What's Changed

Fixes to get things working with upstream main by @tjohnson31415 in #123
[CB] proper cleanup after warmup by @yannicks1 in #130
Refactor ModelInputForSpyre dataclass by @prashantgupta24 in #107
🔥 Bump vLLM and remove scheduler rejection logic by @joerunde in #132
📝 Add release docs by @joerunde in #124
[CB] supporting prompts spanning multiple blocks by @yannicks1 in #128
[CB] strip repeated left padding on batch level by @yannicks1 in #131
[CB] Scheduler constraints by @sducouedic in #129
[CB] remove unnecessary marked dynamic dimensions by @yannicks1 in #135
♻️ fix needed for vllm main by @prashantgupta24 in #138
🐛 Fix new token minimum requirement error message by @gkumbhat in #137
[CB] 🔥 remove cb env vars by @prashantgupta24 in #114
fix: Add validation and test for prompt len % 64 by @rafvasq in #139
Fix local development for vllm==0.8.5 by @wallashss in #140
[Docs] Publish documentation site by @rafvasq in #141
Fix test pypi publications by @joerunde in #144
[CB] Continuous batching support on spyre input batch by @wallashss in #126
🐛 set default value for tkv by @prashantgupta24 in #153
[CB ] e2e continuous batching tests by @prashantgupta24 in #79
[CB] Update spyre model runner for new spyre input batch by @wallashss in #127
Add options to torch.compile by @tdoublep in #149
Add OS-related docs by @rafvasq in #152
📝 Document plugin configuration by @joerunde in #157

New Contributors

@gkumbhat made their first contribution in #137

Full Changelog: v0.1.0...v0.2.0

Contributors

joerunde, wallashss, and 7 other contributors

Assets 2

Releases: vllm-project/vllm-spyre

v0.6.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.5.3

What's Changed

Contributors

Uh oh!

v0.5.2

What's Changed

Contributors

Uh oh!

v0.5.1

This release:

Known Issues

What's Changed

Contributors

Uh oh!

v0.5.0

This release:

Known Issues

What's Changed

New Contributors

Contributors

Uh oh!

v0.4.1

What's Changed

Contributors

Uh oh!

v0.4.0

What's Changed

Contributors

Uh oh!

v0.3.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.3.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.0

What's Changed

New Contributors

Contributors

Uh oh!