Skip to content

Conversation

oulgen
Copy link
Contributor

@oulgen oulgen commented Sep 30, 2025

Stacked PRs:


Improve rms_norm perf

Fixes #660

On my local 4080 laptop GPU perf improved from 3.37x to 6.12x. Will run
CI benchmarks on B200 to validate.

Fixes #660

On my local 4080 laptop GPU perf improved from 3.37x to 6.12x. Will run
CI benchmarks on B200 to validate.

stack-info: PR: #727, branch: oulgen/stack/108
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 30, 2025
normalized = x_tile * inv_rms_tile[:, None]
out[tile_m, :] = (normalized * weight[:].to(torch.float32)).to(out.dtype)
inv_rms[tile_m] = inv_rms_tile.to(out.dtype)
for tile_m in hl.tile(m):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a smart way to address the performance issue!

Just to call out — longer term, it might be ideal for the autotuner to handle eviction policies and loop reduction setup automatically, rather than users specifying them directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree, I was planning to implement that next but wanted to check if this is doable at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feature request] set eviction_policy for tl.load

2 participants