TensorRT-LLM 0.9.0 Release #1451
kaiyux
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
We are very pleased to announce the 0.9.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.
This update includes:
examples/multimodalearly_stopping=Falsein beam search for C++ RuntimetransformersGemma implementation #1147GptSessionwithout OpenMPI Run GptSession without openmpi? #1220executorAPIexamples/bindingsexecutorC++ API, seeexamples/bindings/README.mdexecutorAPI, seedocs/source/executor.mdexamples/high-level-api/README.mdfor guidance)QuantConfigused intrtllm-buildtool, support broader quantization featuresLLM()API to accept engines built bytrtllm-buildcommandSamplingConfigused inLLM.generateorLLM.generate_asyncAPIs, with the support of beam search, a variety of penalties, and more featuresLLM(streaming_llm=...)examples/qwen/README.mdfor the latest commandsexamples/gpt/README.mdfor the latest commandstrtllm-buildcommand, to generalize the feature better to more modelstrtllm-build --max_prompt_embedding_table_sizeinstead.trtllm-build --world_sizeflag to--auto_parallelflag, the option is used for auto parallel planner only.AsyncLLMEngineis removed,tensorrt_llm.GenerationExecutorclass is refactored to work with both explicitly launching withmpirunin the application level, and accept an MPI communicator created bympi4pyexamples/serverare removed, seeexamples/appinstead.modelparameter fromgptManagerBenchmarkandgptSessionBenchmarkencoder_input_len_rangeis not 0, thanks to the contribution from @Eddie-Wang1120 in Fix enc_dec bug and Make several improvements to whisper #992end_idissue for Qwen qwen end_id setting is wrong so cannot stop at right postition! #987head_sizewhen importing Gemma model from HuggingFace Hub, thanks for the contribution from @mfuntowicz in Specify the head_size from the config when importing Gemma from Hugging Face. #1148SamplingConfigtensors inModelRunnerCppModelRunnerCppdoes not transferSamplingConfigTensor fields correctly #1183examples/run.pyonly load one line from--input_fileModelRunnerCppdoes not transferSamplingConfigtensor fields correctlyModelRunnerCppdoes not transferSamplingConfigTensor fields correctly #1183gptManagerBenchmarkbenchmarks/cpp/README.mdgptManagerBenchmarkgptDecoderBatchto support batched samplingnvcr.io/nvidia/pytorch:24.02-py3nvcr.io/nvidia/tritonserver:24.02-py3Currently, there are two key branches in the project:
We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team
This discussion was created from the release TensorRT-LLM 0.9.0 Release.
Beta Was this translation helpful? Give feedback.
All reactions