Improve rms_norm perf #727

oulgen · 2025-09-30T20:48:05Z

Stacked PRs:

->Improve rms_norm perf #727

Improve rms_norm perf

Fixes #660

On my local 4080 laptop GPU perf improved from 3.37x to 6.12x. Will run
CI benchmarks on B200 to validate.

Fixes #660 On my local 4080 laptop GPU perf improved from 3.37x to 6.12x. Will run CI benchmarks on B200 to validate. stack-info: PR: #727, branch: oulgen/stack/108

xuanzhang816 · 2025-10-01T14:56:07Z

examples/rms_norm.py

-        normalized = x_tile * inv_rms_tile[:, None]
-        out[tile_m, :] = (normalized * weight[:].to(torch.float32)).to(out.dtype)
-        inv_rms[tile_m] = inv_rms_tile.to(out.dtype)
+    for tile_m in hl.tile(m):


That’s a smart way to address the performance issue!

Just to call out — longer term, it might be ideal for the autotuner to handle eviction policies and loop reduction setup automatically, rather than users specifying them directly.

Yep, I agree, I was planning to implement that next but wanted to check if this is doable at all.

Improve rms_norm perf

8bb79f4

Fixes #660 On my local 4080 laptop GPU perf improved from 3.37x to 6.12x. Will run CI benchmarks on B200 to validate. stack-info: PR: #727, branch: oulgen/stack/108

oulgen force-pushed the oulgen/stack/108 branch from 270db82 to 8bb79f4 Compare September 30, 2025 20:48

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 30, 2025

xuanzhang816 reviewed Oct 1, 2025

View reviewed changes

Merge branch 'main' into oulgen/stack/108

f67e25b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve rms_norm perf #727

Improve rms_norm perf #727

Uh oh!

oulgen commented Sep 30, 2025 •

edited

Loading

Uh oh!

xuanzhang816 Oct 1, 2025

Uh oh!

oulgen Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve rms_norm perf #727

Are you sure you want to change the base?

Improve rms_norm perf #727

Uh oh!

Conversation

oulgen commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!