Cherry picking keep_fp8_weight_transpose_cache flag refactor and fsdp2 fp8 autocast all gather commits #389

sudhu2k · 2025-12-02T16:55:21Z

Description

This PR is used for cherry picking #349 and #328 into release_v2.4 branch

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

* Initial commit * Removed rocm_utils * Added comment and bug fixes * Grouped IS_HIP_EXTENSION with the property assignment * Reverted transpose.cpp, removed keep_fp8_transpose_cache flag from grouped_linear, removed manual clearing of tensors in modules * Aligning grouped_linear module with upstream * Reverted tests to use _test_granular_accuracy_with_fp8 multiple times as needed * Added comments back * Moved comment to the test --------- Co-authored-by: sudhu2k <[email protected]>

* Initial commit * Removed Print statements, added keep_fp8_transpose cache integration with fsdp2 * Added use_fsdp flag to Linear module, added profile code, added test code, added all reduce for amax * Fixed unit test * Removing all reduce code for amax since by default TE does all reduce when torch.distributed is initialized. * reverting case where out is already present * Added unit test with regualr sgpu training * Modified unit test to compare FSDP2 with DDP * bug fixes * Code cleaning up * Initial commit to add MXFP8 * Added fp8 current scaling. * Added MXFP8, Modified unit test to run based on recipes * Extended use_fsdp to layernorm linear and layernorm mlp * Moved amax reduce from forward to backward for fsdp2 * Added automatic detection of use fsdp from base module * Use SKIP_FP8_REDUCTION_FOR_FSDP2 in backward for check if need to do forward reduce * Added memory profile code, added a check before setting SKIP_FP8_REDUCTION_FOR_FSDP2 * Fix for fused optimizer, changed _elem to _data, code clean up * Fixed layernorm mlp * Code cleanup and added test to pytorch.sh * Removed whitespaces * Fixed comments and license * Added guards * Added reduce for forward in cuda graph backward, added code to remove test artifacts, reverted upstream test file --------- Co-authored-by: sudhu2k <[email protected]>

sudhu2k · 2025-12-02T18:23:27Z

@ipanfilo, @wangye805 The PR is ready for review

ipanfilo · 2025-12-03T05:56:51Z

transformer_engine/pytorch/module/grouped_linear.py

AMD copyright is needed

But nothing significant was added from our side, this PR actually removes the code that was added for keep_fp8_weight_transpose_cache which means technically we are reverting to upstream code for grouped_linear.py

Does it match upstream now?

It does, yes.

ipanfilo

Let's wait for CI to be run and passed

…spose-refactor-cherrypick-rv2.2

sudhu2k requested review from ipanfilo, wangye805 and wenchenvincent as code owners December 2, 2025 16:55

sudhu2k changed the title ~~13679 TE2.4 keep_fp8_transpose_cache refactor (#328)~~ Cherry picking keep_fp8_weight_transpose_cache flag refactor and fsdp2 fp8 autocast all gather commits Dec 2, 2025

ipanfilo requested changes Dec 3, 2025

View reviewed changes

sudhu2k requested a review from ipanfilo December 3, 2025 16:49

ipanfilo approved these changes Dec 3, 2025

View reviewed changes

Merge remote-tracking branch 'origin/release_v2.4_rocm' into fp8-tran…

e7c8b99

…spose-refactor-cherrypick-rv2.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cherry picking keep_fp8_weight_transpose_cache flag refactor and fsdp2 fp8 autocast all gather commits #389

Cherry picking keep_fp8_weight_transpose_cache flag refactor and fsdp2 fp8 autocast all gather commits #389

Uh oh!

sudhu2k commented Dec 2, 2025 •

edited

Loading

Uh oh!

sudhu2k commented Dec 2, 2025

Uh oh!

ipanfilo Dec 3, 2025

Uh oh!

sudhu2k Dec 3, 2025

Uh oh!

ipanfilo Dec 3, 2025

Uh oh!

sudhu2k Dec 3, 2025

Uh oh!

ipanfilo left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cherry picking keep_fp8_weight_transpose_cache flag refactor and fsdp2 fp8 autocast all gather commits #389

Are you sure you want to change the base?

Cherry picking keep_fp8_weight_transpose_cache flag refactor and fsdp2 fp8 autocast all gather commits #389

Uh oh!

Conversation

sudhu2k commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist:

Uh oh!

sudhu2k commented Dec 2, 2025

Uh oh!

ipanfilo Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

sudhu2k Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

ipanfilo Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

sudhu2k Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

ipanfilo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sudhu2k commented Dec 2, 2025 •

edited

Loading