[MISC] Switch dedupe contact sort to use Quadrants bitonic sort#2853
Draft
hughperkins wants to merge 6 commits into
Draft
[MISC] Switch dedupe contact sort to use Quadrants bitonic sort#2853hughperkins wants to merge 6 commits into
hughperkins wants to merge 6 commits into
Conversation
…rt_kv_tiled
Replaces the inlined 15-stage bitonic compare-exchange schedule in
``func_clamp_prune_and_sort_contacts_coop`` (phase 1a) with a one-line
call to the new Quadrants subgroup primitive:
my_key, my_idx = qd.simt.subgroup.bitonic_sort_kv_tiled(my_key, my_idx, 5)
The primitive (added in quadrants hp/bitonic-sort-kv) is a @qd.func that
inlines at compile time and unrolls the same 15 compare-exchange stages
this code used to write inline, so the generated kernel IR is bit-identical
to today on CUDA. Net change: ~30 lines of hand-rolled bitonic code removed,
the sentinel load + write-back wrapper is unchanged, and the rest of the
kernel (clamp + key init + bucket walk + phase 2 + phase 3) is untouched.
``log2_size = 5`` pins the sort to a 32-lane tile, matching the kernel's
hard-coded ``block_dim = _K = 32``. Using the tiled form rather than the
bare ``bitonic_sort_kv(...)`` wrapper keeps the sort width fixed at 32 even
on AMDGPU wave64, where the bare wrapper would otherwise sort across all
64 lanes and mix in garbage from the inactive upper half.
Requires the matching quadrants change to be installed (the public symbol
``qd.simt.subgroup.bitonic_sort_kv_tiled`` is added by that PR).
7 lines of prose -> 3. Same intent: explain why we use the tiled form with ``log2_size = 5`` rather than the bare ``bitonic_sort_kv`` wrapper.
Pair ``_K = qd.static(32)`` with ``_LOG2_K = qd.static(5)`` at the top of the kernel and pass ``_LOG2_K`` into ``bitonic_sort_kv_tiled``. The relationship between the sort width and ``_K`` is now visible at the binding site instead of being a magic 5 sitting next to ``_K = 32``.
With ``_K`` and ``_LOG2_K`` defined together at the top of the kernel
and ``_LOG2_K`` flowing straight into ``bitonic_sort_kv_tiled``, the
explainer ("pins the tiled sort to _K lanes on every backend ...")
just restates what the names already convey.
``qd.static(32)`` is just the int ``32`` at compile time, so ``_K.bit_length() - 1`` evaluates to ``5`` and keeps _K and _LOG2_K in sync if _K is ever retuned.
``qd.static()`` is a no-op on Python int literals -- it evaluates its argument at compile time, and a plain ``32`` is already a Python compile-time int. Several other Genesis solver files wrap kernel-scope ``BLOCK_DIM`` / ``WARP_SIZE`` / ``_K`` constants this way as a defensive marker, but it doesn't change codegen and the bare ints read more directly.
duburcqa
reviewed
Jun 1, 2026
| if not their_lt_mine and (their_key != my_key or their_idx != my_idx): | ||
| my_key = their_key | ||
| my_idx = their_idx | ||
| my_key, my_idx = qd.simt.subgroup.bitonic_sort_kv_tiled(my_key, my_idx, _LOG2_K) |
Collaborator
There was a problem hiding this comment.
Could we have an API that does not require specifying _LOG2_K? Like this could be the default value, because I guess block_dim can be queried from context no? if not, then we should probably expose it. It should not be that hard, since it is already possible to set it from python.
duburcqa
approved these changes
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the inlined 15-stage bitonic compare-exchange schedule in
func_clamp_prune_and_sort_contacts_coop(phase 1a) with a one-line call to the new Quadrants subgroup primitive:The primitive (added in quadrants hp/bitonic-sort-kv) is a @qd.func that inlines at compile time and unrolls the same 15 compare-exchange stages this code used to write inline, so the generated kernel IR is bit-identical to today on CUDA. Net change: ~30 lines of hand-rolled bitonic code removed, the sentinel load + write-back wrapper is unchanged, and the rest of the kernel (clamp + key init + bucket walk + phase 2 + phase 3) is untouched.
log2_size = 5pins the sort to a 32-lane tile, matching the kernel's hard-codedblock_dim = _K = 32. Using the tiled form rather than the barebitonic_sort_kv(...)wrapper keeps the sort width fixed at 32 even on AMDGPU wave64, where the bare wrapper would otherwise sort across all 64 lanes and mix in garbage from the inactive upper half.Requires the matching quadrants change to be installed (the public symbol
qd.simt.subgroup.bitonic_sort_kv_tiledis added by that PR).Description
Related Issue
Resolves Genesis-Embodied-AI/Genesis#
Motivation and Context
How Has This Been / Can This Be Tested?
Screenshots (if appropriate):
Checklist:
Submitting Code Changessection of CONTRIBUTING document.