✨[Feature] Basic GPU and CPU memory control workflow

# Problem Description
Torch-TensorRT compilation for large models (such as LLMs and diffusion models) can consume excessive CPU and GPU memory. This often leads to freezes, CUDA OOM errors, TensorRT compilation failures, or the operating system killing the process. The default behavior may use up to 5× the model size in CPU memory and 2× the model size in GPU memory, which is too high for many environments.


# Solution
Provide compilation options that reduce redundant model copies on CPU/GPU memory.
Specifically:

Enable a memory-trimming mechanism (``export TRIM_CPU_MEMORY=1``).
Provide CPU offloading (offload_module_to_cpu=True) to move the original copy of the model to CPU to save GPU memory.
Lazy engine initialization (lazy_engine_init) to save GPU memory for following compilation when there are **graph breaks**.


| Setting | Effect | Approx. Memory Ratio |
|----------|---------|----------------------|
| Default | Baseline behavior | CPU: 5×, GPU: 2× |
| ``export TRIM_CPU_MEMORY=1`` | Reduces redundant CPU copies | CPU: ~3× |
| ``offload_module_to_cpu=False`` | Further reduces CPU copies | CPU: ~2× |
| ``offload_module_to_cpu=True`` | Reduces GPU usage, increases CPU usage | GPU: ~1×, CPU: +1× |
|``lazy_engine_init=True``| Reduces GPU usage when there are multiple subgraphs | lower GPU memory |

Proper configuration ensures efficient resource use, stable compilation, and predictable performance for large-scale models.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

✨[Feature] Basic GPU and CPU memory control workflow #3908

Problem Description

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Setting	Effect	Approx. Memory Ratio
Default	Baseline behavior	CPU: 5×, GPU: 2×
`export TRIM_CPU_MEMORY=1`	Reduces redundant CPU copies	CPU: ~3×
`offload_module_to_cpu=False`	Further reduces CPU copies	CPU: ~2×
`offload_module_to_cpu=True`	Reduces GPU usage, increases CPU usage	GPU: ~1×, CPU: +1×
`lazy_engine_init=True`	Reduces GPU usage when there are multiple subgraphs	lower GPU memory

✨[Feature] Basic GPU and CPU memory control workflow #3908

Description

Problem Description

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions