-
Notifications
You must be signed in to change notification settings - Fork 64
Use newer version of mma_atom and copy_atom in 00_bmg_gemm #540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use newer version of mma_atom and copy_atom in 00_bmg_gemm #540
Conversation
…d_copy_*, and move tensor/copy initialization to host-side params in to_underlying_arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with the minor changes suggested above.
Edit -- there is a bug in the TiledCopy handling that needs fixing, described below.
This comment was marked as outdated.
This comment was marked as outdated.
Theoretical bf16 peak perf for BMG is 116 TF/s, so the new performance is too high. Either there's a problem in the kernel (not doing the full computation) or something's wrong with the performance computation. |
This comment was marked as outdated.
This comment was marked as outdated.
| CopyOpA copy_a; | ||
| CopyOpB copy_b; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @anamikac-intel -- the TiledCopy operations need to be device-initialized (see the docs) before they can be used. That is why you're seeing page faults/anomalously high performance.
One option is that you copy the Arguments as Params, and only create copy_a/b in the kernel.
Another option is to re-init the tiled_copy instances inside the kernel. Let me make some helpers for you, if you want to choose this path (less code modification).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put a suggested patch for option 2 below. The main thing here is the addition of the device_init at the beginning of the kernel. I would like to remove the need for device_init, but we're waiting on the necessary support from IGC.
diff --git a/include/cutlass/gemm/collective/xe_mma.hpp b/include/cutlass/gemm/collective/xe_mma.hpp
index 34e8af98..d8925af9 100644
--- a/include/cutlass/gemm/collective/xe_mma.hpp
+++ b/include/cutlass/gemm/collective/xe_mma.hpp
@@ -221,6 +221,9 @@ struct CollectiveMma<MainloopXeL1Staged<Stages, Schedule>, TileShape_, ElementA_
static_assert(is_rmem<FrgTensorD>::value, "D tensor must be rmem resident.");
static_assert(is_rmem<FrgTensorC>::value, "C tensor must be rmem resident.");
+ mainloop.copy_a.device_init();
+ mainloop.copy_b.device_init();
+
auto thr_copy_a = mainloop.copy_a.get_slice(thread_idx);
auto thr_copy_b = mainloop.copy_b.get_slice(thread_idx);
diff --git a/include/cute/atom/copy_traits_xe_2d.hpp b/include/cute/atom/copy_traits_xe_2d.hpp
index 04db9364..3a576022 100644
--- a/include/cute/atom/copy_traits_xe_2d.hpp
+++ b/include/cute/atom/copy_traits_xe_2d.hpp
@@ -125,7 +125,7 @@ struct Xe2DTraitsBase
assert((height <= 0xFFFFFF) && "CuTe runtime error: block 2D tensor height exceeds 2^24");
assert((pitch <= 0xFFFFFF) && "CuTe runtime error: block 2D tensor pitch exceeds 2^24");
#endif
- init_payload();
+ device_init();
}
template <class Op2, typename ValType2>
@@ -134,7 +134,7 @@ struct Xe2DTraitsBase
: base_ptr(other.base_ptr), width(other.width), height(other.height), pitch(other.pitch),
tiled_strides(other.tiled_strides)
{
- init_payload();
+ device_init();
}
// Initialize a previously-uninitialized atom.
@@ -145,7 +145,7 @@ struct Xe2DTraitsBase
}
CUTE_DEVICE
- void init_payload() {
+ void device_init() const {
#ifdef __SYCL_DEVICE_ONLY__
payload = __builtin_IB_subgroup_createBlock2DAddressPayload(
base_ptr,
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanchitintel - are you getting 25.1 TFLOPs/s with igc 1.85 & 26.3 TFLOPs/s with igc 2.20 without print enable ? I am getting 97.259 TFlop/s for same config 5120 * 4096* 4096* 1 on igc 2.20 (without print)
Disposition: Passed
Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [97.259]TFlop/s (1.7664)ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @anamikac-intel, I didn't use print while computing throughput.
But I was wrong about using libigc 2.20. It's 2.18.5-1188~25.04 on the system I used.
That must be what's causing poor perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked with igc 2.20 on @tdeng5's machine. I'm indeed seeing 98 TFLOPs/s for those input shapes, and there's no register spill!
Thank you!
Fixes a compilation failure found in #540 when >2D tensors are passed to one of the `make_block_2d_copy_*` functions.
…ck 2D Copy Utilities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM


Modify 00_bmg_gemm to include new mma and copy atoms (#477).
00_bmg_gemm combines two parts: mma and epilogue. To add new atom changes, we need to update both parts since they currently use old atoms. As starting we will:
Old Atom:
Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [96.448]TFlop/s (1.7813)ms
New Atom:
Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [97.259]TFlop/s (1.7664)ms
Also depend on new copy_c/copy_d apis for load/store #572