Skip to content

feat: Add Inductor backend configs#688

Merged
JewelRoam merged 3 commits into
PaddlePaddle:developfrom
JewelRoam:bench
Apr 23, 2026
Merged

feat: Add Inductor backend configs#688
JewelRoam merged 3 commits into
PaddlePaddle:developfrom
JewelRoam:bench

Conversation

@JewelRoam

@JewelRoam JewelRoam commented Apr 15, 2026

Copy link
Copy Markdown
Collaborator

Overview

This PR introduces configuration for PyTorch Inductor backend,
allowing users to select predefined config templates that set groups of
torch._inductor.config overrides. This provides an extension to PyTorch's
official "mode" concept while maintaining full compatibility with existing
test_compiler.py framework.

Motivation

Previously, InductorBackend accepted only basic config parameters through
individual inductor_config dictionary entries. Users could not easily enable
common combinations of Inductor options such as:

  • CUTLASS-based GEMM kernels - For optimal GEMM performance on modern GPUs
  • CUDA Graphs - To reduce kernel launch overhead for small batch inference
  • Model freezing - To inline weights as constants for deployment optimization
  • TMA (Tensor Memory Accelerator) - For H100+ GPUs with hardware acceleration

This PR addresses these limitations by introducing config templates - pre-defined,
well-tested combinations of torch._inductor.config options that users can select
by name. Templates, mode, and options are mutually exclusive, providing clear
separation of concerns.

Changes Summary

1. Inductor Backend Configuration Templates

File: graph_net_bench/torch/backend/inductor_backend.py

Features

  • _INDUCTOR_CONFIG_TEMPLATES dictionary with 6 predefined templates
  • Parameter: graph_net_inductor_config_template - select template by name
  • Mutual exclusivity: exactly one of template/mode/options can be specified
  • No global state mutation: uses torch.compile's options parameter directly

Supported Templates

Template Description Config Overrides
triton Default Triton backend cpp_wrapper: False
cpp_wrapper C++ wrapper for kernels cpp_wrapper: True
cudagraphs CUDA Graphs triton.cudagraphs: True
max_autotune Comprehensive autotuning 4 autotune options
freezing Model freezing freezing: True
tma TMA persistent matmul triton.enable_persistent_tma_matmul: True

Note that the TMA template works universally across GPU architectures:

  • H100+ (CC >= 9.0): Enables TMA persistent kernels
  • A100 or other GPUs: Enables non-TMA persistent kernels as fallback

No runtime error occurs on GPUs without TMA support.

2. CUDA Graphs Compatibility Fix

File: graph_net_bench/torch/test_compiler.py

When CUDA Graphs is enabled, output tensor pointers are recorded to CUDA Graph buffers.
Subsequent model calls overwrite these buffers, causing errors when accessing compiled
output after eager run. Fixed by cloning outputs immediately:

if isinstance(outs, torch.Tensor):
    outs = outs.clone()
elif isinstance(outs, tuple):
    outs = tuple(t.clone() if isinstance(t, torch.Tensor) else t for t in outs)

Note: eval_backend_perf.py and eval_backend_diff.py are unaffected
(torch.save/torch.load creates independent copies).

3. Test

File: test/inductor_backend_test.py (new file, ~290 lines)

Test Coverage:

  • Template validation: 11 tests
  • Parameter handling: 5 tests
  • Config validation: 9 tests
  • Integration: 1 test
  • Total: 26 tests

Usage

# Schema: exactly one of template/mode/options can be specified
--config '{"graph_net_inductor_config_template": "<template_name>"}'    # template
--config '{"mode": "<mode_name>"}'                                    # mode
--config '{"options": {...}}'                                          # custom

# Example: use max-autotune template
python -m graph_net_bench.torch.test_compiler \
    --compiler inductor \
    --model-path samples/torchvision/alexnet \
    --config "$(echo '{"graph_net_inductor_config_template": "max_autotune"}' | base64 -w0)" \
    --trials 5 --warmup 3

Documentation References

All configuration keys verified against PyTorch source code:

## Overview

This PR introduces a flexible configuration system for PyTorch Inductor backend
with 8 predefined config templates, CUDA Graphs compatibility fix,
and comprehensive unit tests (28 tests total).

## Changes

- Inductor backend with 8 config templates (triton, cpp_wrapper, cutlass,
  aten, cudagraphs, max_autotune, freezing, tma)
- CUDA Graphs output buffer overwrite fix in test_compiler.py
- 28 unit tests in test/inductor_backend_test.py

## Testing

- All config keys verified against PyTorch 2.7.1 source code
- All templates tested with actual model compilation
- Unit tests pass: 28/28 OK
- TMA config gracefully falls back on non-TMA GPUs (A100)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@paddle-bot

paddle-bot Bot commented Apr 15, 2026

Copy link
Copy Markdown

Thanks for your contribution!

@JewelRoam JewelRoam changed the title feat: Add Inductor backend config templates and comprehensive test suite feat: Add Inductor backend config templates Apr 15, 2026
…integration

- Rename  to  for clarity
- Remove redundant  and  templates (merge into )
- Add mutual exclusion check: exactly one of template/mode/options can be specified
- Remove global config modification; use torch.compile's  parameter directly
- Clean up  mapping (template no longer affects mode)
- Update tests to reflect template exclusivity and parameter changes
- Simplify __call__ with inline conditional kwargs expansion

Templates now exclusively control torch._inductor.config options,
while  is passed directly to torch.compile without interference.
@JewelRoam JewelRoam changed the title feat: Add Inductor backend config templates feat: Add Inductor backend configs Apr 16, 2026

@Xreki Xreki left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JewelRoam JewelRoam merged commit ac35ce2 into PaddlePaddle:develop Apr 23, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants