Autocast and Weak Typing #3869

narendasan · 2025-10-16T23:54:33Z

narendasan
Oct 16, 2025
Collaborator

TL;DR

Weak typing behavior in TensorRT is deprecated. However it is a good way to maximize performance, Therefore we want to create similar PyTorch native system to use with Torch-TensorRT that recovers some of this behavior.

Goal(s)

An automatic pass that users can opt into which will apply autocasting w.r.t. TRT's perfered ruleset.

Usecases

Proposed APIs / UX

Support already (PyTorch) autocasted networks with strong typing
Add a lowering pass that applies rules similar to ONNX Autocast for users who want weak typing behavior.

We use the combination of the args use_explicit_typing and enable_autocast to represent three modes:

Use user defined precision:          use_explicit_typing=True + enable_autocast=False
TRT chooses precision (weak typing): use_explicit_typing=False + enable_autocast=False
Autocast chooses precision:          use_explicit_typing=True + enable_autocast=True

In this feature, we are focusing on Autocast mode, i.e., setting use_explicit_typing=True and enable_autocast=True. Users can also set low_precision_type, nodes_to_exclude, targets_to_exclude, data_max, and max_depth_of_reduction to specify which ops should run in fp32 and which should run in low precision. We aim to keep consistency with NVIDIA ModelOpt Autocast, so the naming of args is similar. Please refer to ModelOpt Autocast doc for details.

Example Workflow

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=8, kernel_size=3, stride=1, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(in_channels=8, out_channels=16, kernel_size=3, stride=1, padding=1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(16 * 8 * 8, 10)

    def forward(self, x):
        x = self.conv1(x)  # fp32
        x = self.relu1(x)  # fp16
        out = self.pool1(x)  # fp16
        x = self.conv2(out)  # fp16
        x = self.relu2(x)  # fp16
        x = self.pool2(x)  # fp16
        x = self.flatten(x)  # fp16
        with torch.autocast(x.device.type, enabled=True, dtype=torch.float32):
            x = self.fc1(x)  # fp32
            x = torch.add(x, x)  # fp32
        return x, out

model = SimpleCNN().cuda().eval()
inputs = (torch.randn((32, 3, 32, 32), dtype=torch.float32, device="cuda"),)
ep = torch.export.export(model, inputs)

trt_mod = torch_tensorrt.compile(
            ep.module(), 
            arg_inputs=inputs,
            min_block_size=1,
            use_python_runtime=True,
            use_explicit_typing=True,
            enable_autocast=True,
            low_precision_type=torch.float16,
            nodes_to_exclude={"^conv2d$"},
            targets_to_exclude={},
            data_max=512,
            max_depth_of_reduction=None,
        )

In this example, low_precision_type=torch.float16 denotes Autocast should cast normal ops to fp16; nodes_to_exclude denotes the ops with regex pattern ^conv2d$ should remain in fp32, etc. torch.autocast(fp32) is used in the forward function so that any ops within the context manager will be in fp32.

Limitations

Internal Implementation

To implement Autocast in Torch-TRT, we need to 1) determine which ops should be cast to low_precision, like fp16 or bf16, and which ops should be kept in fp32 and 2) modify FX Graph to make every op be the right precision.

1) Rule-based Node Classifier

Similar to ModelOpt Autocast, we have a rule-based node classifier to determine precision of each op. Based on our predefined ruleset, if any rule is met, the op will be in fp32; otherwise, in fp16. If node target is torch.ops.higher_order.wrap_with_autocast or operator.getitem, they will be directly skipped.

Take the above demo as an example, the node classifier's decision is as follows:

Low Precision Nodes: ['relu', 'max_pool2d', 'conv2d_1', 'relu_1', 'max_pool2d_1', 'flatten']
High Precision Nodes: ['conv2d']

2) Modify FX Graph

From step 1 we have determined the precision of each op. Then, we are going to add a pre_lowering pass to insert Cast op before the op. If node target is torch.ops.higher_order.wrap_with_autocast or operator.getitem, they will be directly skipped.

The modified graph of the above demo is as follows:

graph():
    %conv1_weight : [num_users=1] = get_attr[target=conv1.weight]
    %conv1_bias : [num_users=1] = get_attr[target=conv1.bias]
    %conv2_weight : [num_users=1] = get_attr[target=conv2.weight]
    %conv2_bias : [num_users=1] = get_attr[target=conv2.bias]
    %fc1_weight : [num_users=1] = get_attr[target=fc1.weight]
    %fc1_bias : [num_users=1] = get_attr[target=fc1.bias]
    %x : [num_users=1] = placeholder[target=x]
    %convolution : [num_users=1] = call_function[target=torch.ops.aten.convolution.default](args = (%x, %conv1_weight, %conv1_bias, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), kwargs = {})
    %_to_copy : [num_users=1] = call_function[target=torch.ops.aten._to_copy.default](args = (%convolution,), kwargs = {dtype: torch.float16})
    %relu : [num_users=1] = call_function[target=torch.ops.aten.relu.default](args = (%_to_copy,), kwargs = {})
    %max_pool2d_with_indices : [num_users=1] = call_function[target=torch.ops.aten.max_pool2d_with_indices.default](args = (%relu, [2, 2], [2, 2]), kwargs = {})
    %getitem : [num_users=2] = call_function[target=operator.getitem](args = (%max_pool2d_with_indices, 0), kwargs = {})
    %_to_copy_1 : [num_users=1] = call_function[target=torch.ops.aten._to_copy.default](args = (%conv2_weight,), kwargs = {dtype: torch.float16})
    %_to_copy_2 : [num_users=1] = call_function[target=torch.ops.aten._to_copy.default](args = (%conv2_bias,), kwargs = {dtype: torch.float16})
    %convolution_1 : [num_users=1] = call_function[target=torch.ops.aten.convolution.default](args = (%getitem, %_to_copy_1, %_to_copy_2, [1, 1], [1, 1], [1, 1], False, [0, 0], 1), kwargs = {})
    %relu_1 : [num_users=1] = call_function[target=torch.ops.aten.relu.default](args = (%convolution_1,), kwargs = {})
    %max_pool2d_with_indices_1 : [num_users=1] = call_function[target=torch.ops.aten.max_pool2d_with_indices.default](args = (%relu_1, [2, 2], [2, 2]), kwargs = {})
    %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%max_pool2d_with_indices_1, 0), kwargs = {})
    %_reshape_copy : [num_users=1] = call_function[target=torch.ops.aten._reshape_copy.default](args = (%getitem_2, [32, 1024]), kwargs = {})
    %_to_copy_3 : [num_users=1] = call_function[target=torch.ops.aten._to_copy.default](args = (%_reshape_copy,), kwargs = {dtype: torch.float32})
    %linear : [num_users=1] = call_function[target=torch.ops.aten.linear.default](args = (%_to_copy_3, %fc1_weight, %fc1_bias), kwargs = {})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%linear, %linear), kwargs = {})
    return (add, getitem)

Implementation Phases

Prototype -

#3878

MVP `(<TARGET RELEASE VERSION>)`

Extension Phase 1 `(<TARGET RELEASE VERSION>)`

Extension Phase 2 `(<TARGET RELEASE VERSION>)`

narendasan · 2025-10-30T18:17:10Z

narendasan
Oct 30, 2025
Collaborator Author

nodes_to_exclude should work on user specified names for layers like conv1 -> autocast_excluded_nodes
targets_to_exclude should probably be ops_to_exclude and work with op types like torch.ops.aten.conv2d -> autocast_excluded_ops
All the autocast settings should be prefixed with autocast_

For phase 2, should we let users give us a calibration dataloader?
https://docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Autocast and Weak Typing #3869

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Autocast and Weak Typing #3869

Uh oh!

Uh oh!

narendasan Oct 16, 2025 Collaborator

TL;DR

Goal(s)

Usecases

Proposed APIs / UX

Example Workflow

Limitations

Internal Implementation

1) Rule-based Node Classifier

2) Modify FX Graph

Implementation Phases

Prototype -

MVP (<TARGET RELEASE VERSION>)

Extension Phase 1 (<TARGET RELEASE VERSION>)

Extension Phase 2 (<TARGET RELEASE VERSION>)

Replies: 1 comment

Uh oh!

narendasan Oct 30, 2025 Collaborator Author

narendasan
Oct 16, 2025
Collaborator

MVP `(<TARGET RELEASE VERSION>)`

Extension Phase 1 `(<TARGET RELEASE VERSION>)`

Extension Phase 2 `(<TARGET RELEASE VERSION>)`

narendasan
Oct 30, 2025
Collaborator Author