FlatMap: add method for batched GPU construction #1610

publixsubfan · 2025-07-09T23:03:28Z

Summary

Adds a method FlatMap::create() for constructing a flat hash map over a batch of corresponding key-value pairs.

A template parameter ExecSpace is used to specify whether the batched construction happens on the CPU, GPU, or over multiple threads via OpenMP. The passed-in allocator argument must be accessible from the specified execution space.
If two equivalent keys are inserted, the key-value pair with the higher index is selected.

Also adds a benchmark example for FlatMap, which tests the performance of both batched insertion and lookups.

kennyweiss

Thanks @publixsubfan !

Could you please post some performance results?

Please don't forget to update the RELEASE-NOTES.

kennyweiss · 2025-07-10T17:53:47Z

src/axom/core/FlatMap.hpp

  bucket_count = axom::utilities::max(minBuckets, bucket_count);
  // Get the smallest power-of-two number of groups satisfying:
  // N * GroupSize - 1 >= minBuckets
  // TODO: we should add a countl_zero overload for 64-bit integers


kennyweiss · 2025-07-10T17:55:30Z

src/axom/core/FlatMap.hpp

  {
    std::int32_t numGroups = std::ceil((bucket_count + 1) / (double)BucketsPerGroup);
-    m_numGroups2 = 31 - (axom::utilities::countl_zero(numGroups));
+    m_numGroups2 = 32 - (axom::utilities::countl_zero(numGroups));


This change seems subtle/hard won. Do we have a unit test targeting this line?

Yeah, I added a new test here: https://github.com/LLNL/axom/blob/d64fa39c4229f7c21b1c246cab78ef9ca0284dc2/src/axom/core/tests/core_flatmap.hpp#L117-L129

kennyweiss · 2025-07-10T17:57:35Z

src/axom/core/FlatMapUtil.hpp

+
+  AXOM_HOST_DEVICE bool tryLock()
+  {
+    int still_locked = 0;


Any chance the axom atomics can be used/updated to handle/help with this logic?
(Mostly b/c that could harden the axom atomics. If you think this is a one-off and not useful elsewhere, it's fine as is)

I think adding this to the axom atomics would be dependent on support from within RAJA for atomics with memory ordering. Otherwise the logic to implement that might get a little nasty.

IIRC, RAJA default atomics don't support memory ordering. RAJA can be configured to use desul atomics, which do support memory ordering. Unfortunately, we only support using those through the original RAJA atomic interface and so we only provide a default we define: https://github.com/LLNL/RAJA/blob/develop/include/RAJA/policy/desul/atomic.hpp#L22.

We should revisit whether we want to switch to desul atomics by default in RAJA. I think the last time we discussed this, there were still some cases where RAJA atomics were faster than desul. If we did switch to desul by default (which is what Kokkos uses), then we could support the full desul interface.

@publixsubfan let me know if you think we should go this route.

Maybe we could play around with a partial desul default? Something like "default for ordered atomics, but use the original backend for unordered"

I did have a PR for the ordered atomics here: llnl/RAJA#1616, if we wanted to try and clean that up.

Thanks -- since this is somewhat of a one-off and it's not super easy to consolidate it into axom::atomics, I think it's fine as is.

src/axom/core/FlatMapUtil.hpp

src/axom/core/tests/core_flatmap_for_all.hpp

kennyweiss · 2025-07-10T19:35:36Z

src/axom/core/tests/core_flatmap_for_all.hpp

+  for(int i = 0; i < NUM_ELEMS; i++)
+  {
+    auto key = this->getKey(i);
+    auto value = this->getValue(i * 10.0 + 5.0);


Would it make sense to have a test that has repeated value entries?

I'd expect it to be handled properly, but might be good to add a test for it anyway.

src/axom/core/tests/core_flatmap_for_all.hpp

src/axom/core/examples/core_flatmap_perf.cpp

kennyweiss · 2025-07-10T19:43:58Z

src/axom/core/examples/core_flatmap_perf.cpp

Nice example!

publixsubfan · 2025-07-11T22:28:38Z

@kennyweiss -- here's some performance graphs for construction and querying. These were "scaled" to be node-to-node comparisons, meaning we multiplied the ATS-2/ATS-4 numbers for each run by 4 to account for the 4 sockets. For CTS-2, we ran 2 MPI ranks with --cpu-bind=socket and 112 threads each, and summed the results for each run.

BradWhitlock

Nice addition.

…ments

publixsubfan requested review from Arlie-Capps, BradWhitlock, bmhan12, gunney1, kennyweiss and white238 July 9, 2025 23:03

publixsubfan added Core Issues related to Axom's 'core' component GPU Issues related to GPU development labels Jul 9, 2025

kennyweiss requested a review from jcs15c July 9, 2025 23:05

kennyweiss approved these changes Jul 10, 2025

View reviewed changes

This was referenced Jul 15, 2025

Add a function to batch update a FlatMap #1613

Closed

Use axom::FlatMap in spin's Octree implementation #1614

Merged

publixsubfan force-pushed the feature/yang39/flatmap-gpu-insert branch from fc765c6 to 410c298 Compare July 15, 2025 23:49

BradWhitlock approved these changes Jul 24, 2025

View reviewed changes

publixsubfan added 15 commits July 23, 2025 21:01

FlatMap: initial commit of batched construction function

5f1ca78

FlatMap: add initial test for batched capability

299e2d7

FlatMap: add test for batches with duplicates

d3dbc51

FlatMap: add test for pathological keys (all the same)

525d957

FlatMap: handle duplicates in batched insert

e167d53

FlatMap: fix preallocated bucket logic

c1fecc4

FlatMap: loop over k-v pairs instead of bucket slots when placing ele…

ee48d0e

…ments

FlatMap: remove a print

f57da66

FlatMap: add a performance benchmark test driver

86fe169

FlatMap: fixes for batched-insert with no RAJA

dc8b7c2

FlatMap: document create() method

4d5f8a6

Allocator fixup

ac562f7

FlatMap: fixed batched construction tests on GPU

fda8b64

Add workaround for CUDA construction of std::pair

eaaf890

Remove RAJA-specific logic in favor of axom-based wrappers

ad9823f

publixsubfan added 7 commits July 23, 2025 21:01

Update RELEASE-NOTES

5233223

Update copyright notice

7c289a6

FlatMap: improve preallocation logic

c14cae8

FlatMapUtil: make some requested changes

dd19921

FlatMap: add some documentation

96cfb5b

FlatMap: test that second kv pair overwrites

3c2a675

FlatMap: document constant-hash test

f83228f

publixsubfan force-pushed the feature/yang39/flatmap-gpu-insert branch from e574bfd to f83228f Compare July 24, 2025 04:01

publixsubfan merged commit daba41b into develop Jul 24, 2025
15 checks passed

kennyweiss deleted the feature/yang39/flatmap-gpu-insert branch July 25, 2025 18:31

FlatMap: add method for batched GPU construction #1610

FlatMap: add method for batched GPU construction #1610

Uh oh!

Conversation

publixsubfan commented Jul 9, 2025

Summary

Uh oh!

kennyweiss left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

publixsubfan commented Jul 11, 2025

Uh oh!

BradWhitlock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants