Update on the development branch #1599
kaiyux
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
The TensorRT-LLM team is pleased to announce that we are pushing an update to the development branch (and the Triton backend) this May 14, 2024.
This update includes:
trtllm-refitis addedexamples/sample_weight_stripping/README.mddocs/source/advanced/weight-streaming.mdexecutorAPIModelRunnerCppso that it runs with theexecutorAPI for IFB-compatible modelsSchedulerPolicywith the same name inbatch_schedulerandexecutor, and rename it toCapacitySchedulerPolicy.SchedulerPolicytoSchedulerConfigto enhance extensibility. The latter also introduces a chunk-based configuration calledContextChunkingPolicy.use_context_fmha_for_generationargument fromtrtllm-buildcommand since it’s not used anymoregenerate()andgenerate_async()APIs.A B, the original generation result could be<s>A B C D Ewhere onlyC D Eis the actual output, and now the result isC D E.add_special_tokenin the TensorRT-LLM backend toTruemake add_special_tokens/skip_special_tokens default value is true which align with hf setting triton-inference-server/tensorrtllm_backend#446, thanks to the contribution from @XiaobingSuper , and the changes are integrated in Update TensorRT-LLM backend triton-inference-server/tensorrtllm_backend#454.GptSessionandTrtGptModelV1are marked as deprecatedtokens_per_blockargument oftrtllm-buildcommand to 64 for better performancemultiple_profilesargument intrtllm-buildcommand builds more optimization profiles now for better performancedocs/source/kv_cache_reuse.mdThanks,
The TensorRT-LLM Engineering Team
Beta Was this translation helpful? Give feedback.
All reactions