Skip to content

[SYCL][NVPTX] Optimize ID queries when they fit in int #18999

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

frasercrmck
Copy link
Contributor

The NVPTX target was unable to properly optimize the global ID query, despite the user specifying the -fsycl-id-queries-fit-in-int flag.

This is because, once linked, the compiler sees the global ID builtin as (i64 add (mul (i64 zext i32 A), (i64 zext i32 B), (i64 zext i32 C))). Despite knowing that each of A, B and C are 32-bit values, and the final result fits in a 32-bit value, it is not legal to replace this sequence with (i64 zext (add i32 (mul i32 A, B), C)), which is the ideal code here.

The solution to this problem is a new opt-in 'reflection' in the NVPTX implementation of the global ID builtin, which selects a more optimal version. The driver enables this reflection only when the user passes -fsycl-id-queries-fit-in-int.

The NVPTX target was unable to properly optimize the global ID query, despite
the user specifying the -fsycl-id-queries-fit-in-int flag.

This is because, once linked, the compiler sees the global ID builtin as
(i64 add (mul (i64 zext i32 A), (i64 zext i32 B), (i64 zext i32 C))).
Despite knowing that each of A, B and C are 32-bit values, and the final
result fits in a 32-bit value, it is not legal to replace this sequence
with (i64 zext (add i32 (mul i32 A, B), C)), which is the ideal code
here.

The solution to this problem is a new opt-in 'reflection' in the NVPTX
implementation of the global ID builtin, which selects a more optimal
version. The driver enables this reflection only when the user passes
-fsycl-id-queries-fit-in-int.
@frasercrmck
Copy link
Contributor Author

With this patch, a SYCL kernel such as

  q.submit([&](sycl::handler& cgh){
    cgh.parallel_for<GetGlobalId>(sycl::nd_range<1>{N, 1024}, [=](sycl::nd_item<1> item){
      unsigned int globalId{static_cast<unsigned int>(item.get_global_id()[0])};
      A[globalId] = globalId;
    });
  });

will be optimized to

.weak .entry _ZTS11GetGlobalId(
.param .u64 .ptr .global .align 4 _ZTS11GetGlobalId_param_0
)
{
.reg .b32 %r<5>;
.reg .b64 %rd<4>;

//
ld.param.b64 %rd1, [_ZTS11GetGlobalId_param_0];
mov.u32 %r1, %ctaid.x;
mov.u32 %r2, %ntid.x;
mov.u32 %r3, %tid.x;
mad.lo.s32 %r4, %r2, %r1, %r3;
mul.wide.u32 %rd2, %r4, 4;
add.s64 %rd3, %rd1, %rd2;
st.global.b32 [%rd3], %r4;
ret;
//
}

as opposed to the bad old days (before this patch) where we got

.weak .entry _ZTS11GetGlobalId(
.param .u64 .ptr .global .align 4 _ZTS11GetGlobalId_param_0
)
{
.reg .b32 %r<4>;
.reg .b64 %rd<8>;

//
ld.param.b64 %rd1, [_ZTS11GetGlobalId_param_0];
mov.u32 %r1, %ctaid.x;
mov.u32 %r2, %ntid.x;
mul.wide.u32 %rd2, %r2, %r1;
mov.u32 %r3, %tid.x;
cvt.u64.u32 %rd3, %r3;
add.s64 %rd4, %rd2, %rd3;
and.b64 %rd5, %rd4, 4294967295;
shl.b64 %rd6, %rd5, 2;
add.s64 %rd7, %rd1, %rd6;
st.global.b32 [%rd7], %rd4;
ret;
//
}

Copy link
Contributor

@ldrumm ldrumm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here I was, happy in the state of having forgotten about nvvm_reflect...

Copy link
Contributor

@mdtoguchi mdtoguchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK for driver

@steffenlarsen steffenlarsen merged commit 2a58675 into intel:sycl Jun 17, 2025
28 of 29 checks passed
@frasercrmck frasercrmck deleted the sycl-nvptx-id-queries-fit-in-int branch June 17, 2025 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants