Skip to content

[MISC] Switch dedupe contact sort to use Quadrants bitonic sort#2853

Draft
hughperkins wants to merge 6 commits into
Genesis-Embodied-AI:mainfrom
hughperkins:hp/use-bitonic-sort-kv
Draft

[MISC] Switch dedupe contact sort to use Quadrants bitonic sort#2853
hughperkins wants to merge 6 commits into
Genesis-Embodied-AI:mainfrom
hughperkins:hp/use-bitonic-sort-kv

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

Replaces the inlined 15-stage bitonic compare-exchange schedule in func_clamp_prune_and_sort_contacts_coop (phase 1a) with a one-line call to the new Quadrants subgroup primitive:

my_key, my_idx = qd.simt.subgroup.bitonic_sort_kv_tiled(my_key, my_idx, 5)

The primitive (added in quadrants hp/bitonic-sort-kv) is a @qd.func that inlines at compile time and unrolls the same 15 compare-exchange stages this code used to write inline, so the generated kernel IR is bit-identical to today on CUDA. Net change: ~30 lines of hand-rolled bitonic code removed, the sentinel load + write-back wrapper is unchanged, and the rest of the kernel (clamp + key init + bucket walk + phase 2 + phase 3) is untouched.

log2_size = 5 pins the sort to a 32-lane tile, matching the kernel's hard-coded block_dim = _K = 32. Using the tiled form rather than the bare bitonic_sort_kv(...) wrapper keeps the sort width fixed at 32 even on AMDGPU wave64, where the bare wrapper would otherwise sort across all 64 lanes and mix in garbage from the inactive upper half.

Requires the matching quadrants change to be installed (the public symbol qd.simt.subgroup.bitonic_sort_kv_tiled is added by that PR).

Description

Related Issue

Resolves Genesis-Embodied-AI/Genesis#

Motivation and Context

How Has This Been / Can This Be Tested?

Screenshots (if appropriate):

Checklist:

  • I read the CONTRIBUTING document.
  • I followed the Submitting Code Changes section of CONTRIBUTING document.
  • I tagged the title correctly (including BUG FIX/FEATURE/MISC/BREAKING)
  • I updated the documentation accordingly or no change is needed.
  • I tested my changes and added instructions on how to test it for reviewers.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

…rt_kv_tiled

Replaces the inlined 15-stage bitonic compare-exchange schedule in
``func_clamp_prune_and_sort_contacts_coop`` (phase 1a) with a one-line
call to the new Quadrants subgroup primitive:

    my_key, my_idx = qd.simt.subgroup.bitonic_sort_kv_tiled(my_key, my_idx, 5)

The primitive (added in quadrants hp/bitonic-sort-kv) is a @qd.func that
inlines at compile time and unrolls the same 15 compare-exchange stages
this code used to write inline, so the generated kernel IR is bit-identical
to today on CUDA. Net change: ~30 lines of hand-rolled bitonic code removed,
the sentinel load + write-back wrapper is unchanged, and the rest of the
kernel (clamp + key init + bucket walk + phase 2 + phase 3) is untouched.

``log2_size = 5`` pins the sort to a 32-lane tile, matching the kernel's
hard-coded ``block_dim = _K = 32``. Using the tiled form rather than the
bare ``bitonic_sort_kv(...)`` wrapper keeps the sort width fixed at 32 even
on AMDGPU wave64, where the bare wrapper would otherwise sort across all
64 lanes and mix in garbage from the inactive upper half.

Requires the matching quadrants change to be installed (the public symbol
``qd.simt.subgroup.bitonic_sort_kv_tiled`` is added by that PR).
7 lines of prose -> 3.  Same intent: explain why we use the tiled form with
``log2_size = 5`` rather than the bare ``bitonic_sort_kv`` wrapper.
Pair ``_K = qd.static(32)`` with ``_LOG2_K = qd.static(5)`` at the top
of the kernel and pass ``_LOG2_K`` into ``bitonic_sort_kv_tiled``.  The
relationship between the sort width and ``_K`` is now visible at the
binding site instead of being a magic 5 sitting next to ``_K = 32``.
With ``_K`` and ``_LOG2_K`` defined together at the top of the kernel
and ``_LOG2_K`` flowing straight into ``bitonic_sort_kv_tiled``, the
explainer ("pins the tiled sort to _K lanes on every backend ...")
just restates what the names already convey.
``qd.static(32)`` is just the int ``32`` at compile time, so
``_K.bit_length() - 1`` evaluates to ``5`` and keeps _K and _LOG2_K
in sync if _K is ever retuned.
``qd.static()`` is a no-op on Python int literals -- it evaluates its
argument at compile time, and a plain ``32`` is already a Python
compile-time int.  Several other Genesis solver files wrap kernel-scope
``BLOCK_DIM`` / ``WARP_SIZE`` / ``_K`` constants this way as a defensive
marker, but it doesn't change codegen and the bare ints read more
directly.
if not their_lt_mine and (their_key != my_key or their_idx != my_idx):
my_key = their_key
my_idx = their_idx
my_key, my_idx = qd.simt.subgroup.bitonic_sort_kv_tiled(my_key, my_idx, _LOG2_K)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have an API that does not require specifying _LOG2_K? Like this could be the default value, because I guess block_dim can be queried from context no? if not, then we should probably expose it. It should not be that hard, since it is already possible to set it from python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants