Update on the development branch #1599

kaiyux · 2024-05-14T08:56:48Z

kaiyux
May 14, 2024
Maintainer

Hi,

The TensorRT-LLM team is pleased to announce that we are pushing an update to the development branch (and the Triton backend) this May 14, 2024.

This update includes:

Model Support
- Support Neva
- Support Kosmos-2
Features
- Support quantization for Nemotron models
- Add LoRA support for Mixtral and Qwen
- Add weight-stripping feature
  - A new command trtllm-refit is added
  - See documentation at examples/sample_weight_stripping/README.md
- Add weight streaming feature
  - See documentation at docs/source/advanced/weight-streaming.md
- The Python high level API
  - Add embedding parallel, embedding sharing and fused MLP support
  - Enable the usage of executor API
- Add in-flight batching support for ChatGLM models
- Add support to ModelRunnerCpp so that it runs with the executor API for IFB-compatible models
API
- [BREAKING CHANGE] Refactor scheduling configurations
  - Unify the SchedulerPolicy with the same name in batch_scheduler and executor, and rename it to CapacitySchedulerPolicy.
  - Expand the existing configuration scheduling strategy from SchedulerPolicy to SchedulerConfig to enhance extensibility. The latter also introduces a chunk-based configuration called ContextChunkingPolicy.
- [BREAKING CHANGE] Remove use_context_fmha_for_generation argument from trtllm-build command since it’s not used anymore
- [BREAKING CHANGE] Input prompt is removed from the generation output in the generate() and generate_async() APIs.
  - E.g. Given a prompt as A B, the original generation result could be <s>A B C D E where only C D E is the actual output, and now the result is C D E.
- [BREAKING CHANGE] Switch default add_special_token in the TensorRT-LLM backend to True make add_special_tokens/skip_special_tokens default value is true which align with hf setting triton-inference-server/tensorrtllm_backend#446, thanks to the contribution from @XiaobingSuper , and the changes are integrated in Update TensorRT-LLM backend triton-inference-server/tensorrtllm_backend#454.
- GptSession and TrtGptModelV1 are marked as deprecated
Bug fixes
- Fix a NVRTC runtime ABI compatibility issue
Performance
- [BREAKING CHANGE] Set default tokens_per_block argument of trtllm-build command to 64 for better performance
- Enhance the multiple profiles feature, multiple_profiles argument in trtllm-build command builds more optimization profiles now for better performance
- Enhance the custom AllReduce by adding a heuristic: fall back to use native NCCL kernel when hardware requirements are not satisfied to get the best performance
- Optimize the performance of checkpoint conversion process for LLaMA
Documentation
- Add documentation for KV cache reuse feature, see docs/source/kv_cache_reuse.md

Thanks,
The TensorRT-LLM Engineering Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update on the development branch #1599

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Update on the development branch #1599

Uh oh!

kaiyux May 14, 2024 Maintainer

Replies: 0 comments

kaiyux
May 14, 2024
Maintainer