Skip to content

Conversation

@anamikac-intel
Copy link

@anamikac-intel anamikac-intel commented Sep 29, 2025

Modify 00_bmg_gemm to include new mma and copy atoms (#477).
00_bmg_gemm combines two parts: mma and epilogue. To add new atom changes, we need to update both parts since they currently use old atoms. As starting we will:

Keep CollectiveEpilogue unchanged for now
Only modify CollectiveMma first

Old Atom:

Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [96.448]TFlop/s (1.7813)ms

New Atom:

Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [97.259]TFlop/s (1.7664)ms

Also depend on new copy_c/copy_d apis for load/store #572

@anamikac-intel anamikac-intel marked this pull request as ready for review September 29, 2025 08:11
@anamikac-intel anamikac-intel changed the title Use newer version on mma_atom and copy_atom in 00_bmg_gemm Use newer version of mma_atom and copy_atom in 00_bmg_gemm Sep 30, 2025
…d_copy_*, and move tensor/copy initialization to host-side params in to_underlying_arguments
Copy link

@petercad petercad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with the minor changes suggested above.

Edit -- there is a bug in the TiledCopy handling that needs fixing, described below.

@sanchitintel

This comment was marked as outdated.

@petercad
Copy link

With New Atom perf increase by 2x

Theoretical bf16 peak perf for BMG is 116 TF/s, so the new performance is too high. Either there's a problem in the kernel (not doing the full computation) or something's wrong with the performance computation.

@sanchitintel

This comment was marked as outdated.

@tdeng5
Copy link

tdeng5 commented Oct 16, 2025

we checked some shapes' performance:
image

Comment on lines 143 to 144
CopyOpA copy_a;
CopyOpB copy_b;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @anamikac-intel -- the TiledCopy operations need to be device-initialized (see the docs) before they can be used. That is why you're seeing page faults/anomalously high performance.

One option is that you copy the Arguments as Params, and only create copy_a/b in the kernel.

Another option is to re-init the tiled_copy instances inside the kernel. Let me make some helpers for you, if you want to choose this path (less code modification).

Copy link

@petercad petercad Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a suggested patch for option 2 below. The main thing here is the addition of the device_init at the beginning of the kernel. I would like to remove the need for device_init, but we're waiting on the necessary support from IGC.

diff --git a/include/cutlass/gemm/collective/xe_mma.hpp b/include/cutlass/gemm/collective/xe_mma.hpp
index 34e8af98..d8925af9 100644
--- a/include/cutlass/gemm/collective/xe_mma.hpp
+++ b/include/cutlass/gemm/collective/xe_mma.hpp
@@ -221,6 +221,9 @@ struct CollectiveMma<MainloopXeL1Staged<Stages, Schedule>, TileShape_, ElementA_
     static_assert(is_rmem<FrgTensorD>::value, "D tensor must be rmem resident.");
     static_assert(is_rmem<FrgTensorC>::value, "C tensor must be rmem resident.");

+    mainloop.copy_a.device_init();
+    mainloop.copy_b.device_init();
+
     auto thr_copy_a = mainloop.copy_a.get_slice(thread_idx);
     auto thr_copy_b = mainloop.copy_b.get_slice(thread_idx);

diff --git a/include/cute/atom/copy_traits_xe_2d.hpp b/include/cute/atom/copy_traits_xe_2d.hpp
index 04db9364..3a576022 100644
--- a/include/cute/atom/copy_traits_xe_2d.hpp
+++ b/include/cute/atom/copy_traits_xe_2d.hpp
@@ -125,7 +125,7 @@ struct Xe2DTraitsBase
     assert((height <= 0xFFFFFF) && "CuTe runtime error: block 2D tensor height exceeds 2^24");
     assert((pitch <= 0xFFFFFF) && "CuTe runtime error: block 2D tensor pitch exceeds 2^24");
 #endif
-    init_payload();
+    device_init();
   }

   template <class Op2, typename ValType2>
@@ -134,7 +134,7 @@ struct Xe2DTraitsBase
     : base_ptr(other.base_ptr), width(other.width), height(other.height), pitch(other.pitch),
       tiled_strides(other.tiled_strides)
   {
-    init_payload();
+    device_init();
   }

   // Initialize a previously-uninitialized atom.
@@ -145,7 +145,7 @@ struct Xe2DTraitsBase
   }

   CUTE_DEVICE
-  void init_payload() {
+  void device_init() const {
 #ifdef __SYCL_DEVICE_ONLY__
     payload = __builtin_IB_subgroup_createBlock2DAddressPayload(
       base_ptr,

This comment was marked as resolved.

Copy link
Author

@anamikac-intel anamikac-intel Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanchitintel - are you getting 25.1 TFLOPs/s with igc 1.85 & 26.3 TFLOPs/s with igc 2.20 without print enable ? I am getting 97.259 TFlop/s for same config 5120 * 4096* 4096* 1 on igc 2.20 (without print)

Disposition: Passed
Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [97.259]TFlop/s (1.7664)ms

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @anamikac-intel, I didn't use print while computing throughput.
But I was wrong about using libigc 2.20. It's 2.18.5-1188~25.04 on the system I used.
That must be what's causing poor perf.

Copy link

@sanchitintel sanchitintel Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked with igc 2.20 on @tdeng5's machine. I'm indeed seeing 98 TFLOPs/s for those input shapes, and there's no register spill!

Thank you!

tdeng5 pushed a commit that referenced this pull request Oct 17, 2025
Fixes a compilation failure found in #540 when >2D tensors are passed to
one of the `make_block_2d_copy_*` functions.
@Antonyvance Antonyvance added the urgent PR requires a urgent attention (for release or blocking another PR) label Oct 17, 2025
@Antonyvance Antonyvance added this to the 0.6 milestone Oct 17, 2025
@anamikac-intel
Copy link
Author

anamikac-intel commented Oct 19, 2025

Performance results: new vs legacy implementation on different problem sizes (Tested on IGC 2.20)

image

Copy link

@jiyang1011 jiyang1011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rolandschulz rolandschulz merged commit 7feb377 into intel:main Oct 22, 2025
2 of 12 checks passed
rolandschulz pushed a commit that referenced this pull request Oct 23, 2025
#540 landed first.
#572 had an older base commit than #540's commit when it was merged.
Resolving conflict
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

urgent PR requires a urgent attention (for release or blocking another PR)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants