-
Notifications
You must be signed in to change notification settings - Fork 370
Description
Problem Description
Torch-TensorRT compilation for large models (such as LLMs and diffusion models) can consume excessive CPU and GPU memory. This often leads to freezes, CUDA OOM errors, TensorRT compilation failures, or the operating system killing the process. The default behavior may use up to 5× the model size in CPU memory and 2× the model size in GPU memory, which is too high for many environments.
Solution
Provide compilation options that reduce redundant model copies on CPU/GPU memory.
Specifically:
Enable a memory-trimming mechanism (export TRIM_CPU_MEMORY=1).
Provide CPU offloading (offload_module_to_cpu=True) to move the original copy of the model to CPU to save GPU memory.
Lazy engine initialization (lazy_engine_init) to save GPU memory for following compilation when there are graph breaks.
| Setting | Effect | Approx. Memory Ratio |
|---|---|---|
| Default | Baseline behavior | CPU: 5×, GPU: 2× |
export TRIM_CPU_MEMORY=1 |
Reduces redundant CPU copies | CPU: ~3× |
offload_module_to_cpu=False |
Further reduces CPU copies | CPU: ~2× |
offload_module_to_cpu=True |
Reduces GPU usage, increases CPU usage | GPU: ~1×, CPU: +1× |
lazy_engine_init=True |
Reduces GPU usage when there are multiple subgraphs | lower GPU memory |
Proper configuration ensures efficient resource use, stable compilation, and predictable performance for large-scale models.