Use newer version of mma_atom and copy_atom in 00_bmg_gemm #540

anamikac-intel · 2025-09-29T06:16:37Z

Modify 00_bmg_gemm to include new mma and copy atoms (#477).
00_bmg_gemm combines two parts: mma and epilogue. To add new atom changes, we need to update both parts since they currently use old atoms. As starting we will:

Keep CollectiveEpilogue unchanged for now
Only modify CollectiveMma first

Old Atom:

Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [96.448]TFlop/s (1.7813)ms

New Atom:

Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [97.259]TFlop/s (1.7664)ms

Also depend on new copy_c/copy_d apis for load/store #572

include/cutlass/gemm/collective/xe_mma.hpp

examples/00_bmg_gemm/00_bmg_gemm.cpp

examples/00_bmg_gemm/CMakeLists.txt

include/cutlass/gemm/kernel/gemm_universal_decl.h

include/cutlass/gemm/kernel/xe_gemm.hpp

examples/00_bmg_gemm/00_bmg_gemm.cpp

include/cutlass/gemm/collective/xe_mma.hpp

examples/00_bmg_gemm/00_bmg_gemm.cpp

include/cutlass/gemm/collective/xe_mma.hpp

examples/00_bmg_gemm/00_bmg_gemm.cpp

include/cutlass/gemm/collective/collective_mma.hpp

…tions

…d_copy_*, and move tensor/copy initialization to host-side params in to_underlying_arguments

examples/00_bmg_gemm/00_bmg_gemm.cpp

petercad

Approving with the minor changes suggested above.

Edit -- there is a bug in the TiledCopy handling that needs fixing, described below.

petercad · 2025-10-15T22:02:36Z

With New Atom perf increase by 2x

Theoretical bf16 peak perf for BMG is 116 TF/s, so the new performance is too high. Either there's a problem in the kernel (not doing the full computation) or something's wrong with the performance computation.

tdeng5 · 2025-10-16T02:26:56Z

we checked some shapes' performance:

include/cutlass/gemm/collective/xe_mma.hpp

examples/00_bmg_gemm/00_bmg_gemm.cpp

include/cutlass/gemm/collective/xe_mma.hpp

petercad · 2025-10-16T15:22:10Z

include/cutlass/gemm/collective/xe_mma.hpp

+    CopyOpA copy_a;
+    CopyOpB copy_b;


Hi @anamikac-intel -- the TiledCopy operations need to be device-initialized (see the docs) before they can be used. That is why you're seeing page faults/anomalously high performance.

One option is that you copy the Arguments as Params, and only create copy_a/b in the kernel.

Another option is to re-init the tiled_copy instances inside the kernel. Let me make some helpers for you, if you want to choose this path (less code modification).

I put a suggested patch for option 2 below. The main thing here is the addition of the device_init at the beginning of the kernel. I would like to remove the need for device_init, but we're waiting on the necessary support from IGC.

diff --git a/include/cutlass/gemm/collective/xe_mma.hpp b/include/cutlass/gemm/collective/xe_mma.hpp index 34e8af98..d8925af9 100644 --- a/include/cutlass/gemm/collective/xe_mma.hpp +++ b/include/cutlass/gemm/collective/xe_mma.hpp @@ -221,6 +221,9 @@ struct CollectiveMma<MainloopXeL1Staged<Stages, Schedule>, TileShape_, ElementA_ static_assert(is_rmem<FrgTensorD>::value, "D tensor must be rmem resident."); static_assert(is_rmem<FrgTensorC>::value, "C tensor must be rmem resident."); + mainloop.copy_a.device_init(); + mainloop.copy_b.device_init(); + auto thr_copy_a = mainloop.copy_a.get_slice(thread_idx); auto thr_copy_b = mainloop.copy_b.get_slice(thread_idx); diff --git a/include/cute/atom/copy_traits_xe_2d.hpp b/include/cute/atom/copy_traits_xe_2d.hpp index 04db9364..3a576022 100644 --- a/include/cute/atom/copy_traits_xe_2d.hpp +++ b/include/cute/atom/copy_traits_xe_2d.hpp @@ -125,7 +125,7 @@ struct Xe2DTraitsBase assert((height <= 0xFFFFFF) && "CuTe runtime error: block 2D tensor height exceeds 2^24"); assert((pitch <= 0xFFFFFF) && "CuTe runtime error: block 2D tensor pitch exceeds 2^24"); #endif - init_payload(); + device_init(); } template <class Op2, typename ValType2> @@ -134,7 +134,7 @@ struct Xe2DTraitsBase : base_ptr(other.base_ptr), width(other.width), height(other.height), pitch(other.pitch), tiled_strides(other.tiled_strides) { - init_payload(); + device_init(); } // Initialize a previously-uninitialized atom. @@ -145,7 +145,7 @@ struct Xe2DTraitsBase } CUTE_DEVICE - void init_payload() { + void device_init() const { #ifdef __SYCL_DEVICE_ONLY__ payload = __builtin_IB_subgroup_createBlock2DAddressPayload( base_ptr,

@sanchitintel - are you getting 25.1 TFLOPs/s with igc 1.85 & 26.3 TFLOPs/s with igc 2.20 without print enable ? I am getting 97.259 TFlop/s for same config 5120 * 4096* 4096* 1 on igc 2.20 (without print)

Disposition: Passed
Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [97.259]TFlop/s (1.7664)ms

Hi @anamikac-intel, I didn't use print while computing throughput.
But I was wrong about using libigc 2.20. It's 2.18.5-1188~25.04 on the system I used.
That must be what's causing poor perf.

Just checked with igc 2.20 on @tdeng5's machine. I'm indeed seeing 98 TFLOPs/s for those input shapes, and there's no register spill!

Thank you!

Fixes a compilation failure found in #540 when >2D tensors are passed to one of the `make_block_2d_copy_*` functions.

include/cutlass/gemm/collective/xe_mma.hpp

…ck 2D Copy Utilities

anamikac-intel · 2025-10-19T11:29:01Z

Performance results: new vs legacy implementation on different problem sizes (Tested on IGC 2.20)

examples/00_bmg_gemm/00_bmg_gemm.cpp

jiyang1011

LGTM

#540 landed first. #572 had an older base commit than #540's commit when it was merged. Resolving conflict

Anamika Chatterjee added 2 commits September 29, 2025 08:10

Test commit

83f1de8

Enable new mma and copy atoms

0b184f0

anamikac-intel marked this pull request as ready for review September 29, 2025 08:11

anamikac-intel changed the title ~~Use newer version on mma_atom and copy_atom in 00_bmg_gemm~~ Use newer version of mma_atom and copy_atom in 00_bmg_gemm Sep 30, 2025

Anamika Chatterjee added 2 commits September 30, 2025 15:39

adding legacy code back for collectivemma and gemmuniversal

ef1bafa

delete unwanted file

f210ba3

petercad reviewed Sep 30, 2025

View reviewed changes

include/cutlass/gemm/collective/xe_mma.hpp Outdated Show resolved Hide resolved

petercad reviewed Sep 30, 2025

View reviewed changes

include/cutlass/gemm/collective/xe_mma.hpp Outdated Show resolved Hide resolved

petercad reviewed Sep 30, 2025

View reviewed changes

examples/00_bmg_gemm/00_bmg_gemm.cpp Outdated Show resolved Hide resolved

petercad reviewed Sep 30, 2025

View reviewed changes

examples/00_bmg_gemm/CMakeLists.txt Outdated Show resolved Hide resolved

petercad reviewed Sep 30, 2025

View reviewed changes

include/cutlass/gemm/kernel/gemm_universal_decl.h Outdated Show resolved Hide resolved

petercad reviewed Sep 30, 2025

View reviewed changes

include/cutlass/gemm/kernel/xe_gemm.hpp Outdated Show resolved Hide resolved

Anamika Chatterjee added 2 commits October 1, 2025 12:22

Changes added based on feedback

5f5a8b7

Remove xe_gemm_legacy as its not longer used

c55ac28

rolandschulz reviewed Oct 1, 2025

View reviewed changes

examples/00_bmg_gemm/00_bmg_gemm.cpp Show resolved Hide resolved

petercad reviewed Oct 2, 2025

View reviewed changes

include/cutlass/gemm/collective/xe_mma.hpp Outdated Show resolved Hide resolved

petercad reviewed Oct 2, 2025

View reviewed changes

examples/00_bmg_gemm/00_bmg_gemm.cpp Outdated Show resolved Hide resolved

Changes added based on feedback

946b46c

petercad reviewed Oct 3, 2025

View reviewed changes

include/cutlass/gemm/collective/xe_mma.hpp Outdated Show resolved Hide resolved

kausikmaiti reviewed Oct 4, 2025

View reviewed changes

Anamika Chatterjee and others added 5 commits October 4, 2025 17:52

Applied review comments

c97f011

Add compile-time checks to enforce new XE copy atoms in block 2D func…

9691e60

…tions

Modified static assert message

93b076a

Modified static assert message

a6f068c

Merge branch 'intel:main' into anamikac/add-newatoms

fcbfecf

petercad mentioned this pull request Oct 7, 2025

[CuTe] [Xe] Fix make_block_2d_copy_* for batched tensors #549

Merged

Move legacy example to legacy folder, pass 2D strides to make_block_2…

e1e64f7

…d_copy_*, and move tensor/copy initialization to host-side params in to_underlying_arguments

rolandschulz reviewed Oct 9, 2025

View reviewed changes

examples/00_bmg_gemm/00_bmg_gemm.cpp Outdated Show resolved Hide resolved

anamikac-intel mentioned this pull request Oct 10, 2025

Use newer version of mma_atom and copy_atom in CollectiveEpilogue for 00_bmg_gemm test #553

Closed

Applied reviwer comment

ea67069

petercad approved these changes Oct 15, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

sanchitintel reviewed Oct 16, 2025

View reviewed changes

include/cutlass/gemm/collective/xe_mma.hpp Outdated Show resolved Hide resolved

sanchitintel reviewed Oct 16, 2025

View reviewed changes

examples/00_bmg_gemm/00_bmg_gemm.cpp Show resolved Hide resolved

petercad reviewed Oct 16, 2025

View reviewed changes

include/cutlass/gemm/collective/xe_mma.hpp Outdated Show resolved Hide resolved

petercad reviewed Oct 16, 2025

View reviewed changes

tdeng5 pushed a commit that referenced this pull request Oct 17, 2025

[CuTe] [Xe] Fix make_block_2d_copy_* for batched tensors (#549)

35e80e1

Fixes a compilation failure found in #540 when >2D tensors are passed to one of the `make_block_2d_copy_*` functions.

Applied review comments

ca503bf

Antonyvance added the urgent PR requires a urgent attention (for release or blocking another PR) label Oct 17, 2025

Antonyvance added this to the 0.6 milestone Oct 17, 2025

sanchitintel reviewed Oct 18, 2025

View reviewed changes

include/cutlass/gemm/collective/xe_mma.hpp Show resolved Hide resolved

Add batch_idx to global tensor passed to make_block_2d_copy_* and Blo…

4eb3bf3

…ck 2D Copy Utilities

jiyang1011 reviewed Oct 20, 2025

View reviewed changes

examples/00_bmg_gemm/00_bmg_gemm.cpp Show resolved Hide resolved

anamikac-intel and others added 2 commits October 21, 2025 00:18

Merge branch 'intel:main' into anamikac/add-newatoms

07aa4c8

Added comments on why batch indexing used for make_block_2d_copy_*

800480a

taozha2 approved these changes Oct 21, 2025

View reviewed changes

examples/00_bmg_gemm/00_bmg_gemm.cpp Show resolved Hide resolved

jiyang1011 approved these changes Oct 21, 2025

View reviewed changes

anamikac-intel mentioned this pull request Oct 22, 2025

Use newer version of copy_atom in epilogue collective #573

Open

tdeng5 approved these changes Oct 22, 2025

View reviewed changes

Merge branch 'intel:main' into anamikac/add-newatoms

018ffb8

rolandschulz approved these changes Oct 22, 2025

View reviewed changes

sanchitintel approved these changes Oct 22, 2025

View reviewed changes

rolandschulz merged commit 7feb377 into intel:main Oct 22, 2025
2 of 12 checks passed

This was referenced Oct 23, 2025

[CuTe] [Xe] Separate make_block_2d_copy_{C,D} APIs for loads/stores #572

Merged

Resolve conflict between 572 & 540 #576

Merged

rolandschulz pushed a commit that referenced this pull request Oct 23, 2025

Resolve conflict between 572 & 540 (#576)

9555cbd

#540 landed first. #572 had an older base commit than #540's commit when it was merged. Resolving conflict

Use newer version of mma_atom and copy_atom in 00_bmg_gemm #540

Use newer version of mma_atom and copy_atom in 00_bmg_gemm #540

Uh oh!

Conversation

anamikac-intel commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petercad left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

petercad commented Oct 15, 2025

Uh oh!

This comment was marked as outdated.

tdeng5 commented Oct 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petercad Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

anamikac-intel Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanchitintel Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

sanchitintel Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anamikac-intel commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiyang1011 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

anamikac-intel commented Sep 29, 2025 •

edited

Loading

petercad left a comment •

edited

Loading

petercad Oct 16, 2025 •

edited

Loading

anamikac-intel Oct 17, 2025 •

edited

Loading

sanchitintel Oct 17, 2025 •

edited

Loading

anamikac-intel commented Oct 19, 2025 •

edited

Loading