Skip to content

Conversation

@publixsubfan
Copy link
Contributor

Summary

Adds a method FlatMap::create() for constructing a flat hash map over a batch of corresponding key-value pairs.

  • A template parameter ExecSpace is used to specify whether the batched construction happens on the CPU, GPU, or over multiple threads via OpenMP. The passed-in allocator argument must be accessible from the specified execution space.
  • If two equivalent keys are inserted, the key-value pair with the higher index is selected.

Also adds a benchmark example for FlatMap, which tests the performance of both batched insertion and lookups.

@publixsubfan publixsubfan added Core Issues related to Axom's 'core' component GPU Issues related to GPU development labels Jul 9, 2025
@kennyweiss kennyweiss requested a review from jcs15c July 9, 2025 23:05
Copy link
Member

@kennyweiss kennyweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @publixsubfan !

Could you please post some performance results?

Please don't forget to update the RELEASE-NOTES.

bucket_count = axom::utilities::max(minBuckets, bucket_count);
// Get the smallest power-of-two number of groups satisfying:
// N * GroupSize - 1 >= minBuckets
// TODO: we should add a countl_zero overload for 64-bit integers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

{
std::int32_t numGroups = std::ceil((bucket_count + 1) / (double)BucketsPerGroup);
m_numGroups2 = 31 - (axom::utilities::countl_zero(numGroups));
m_numGroups2 = 32 - (axom::utilities::countl_zero(numGroups));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems subtle/hard won. Do we have a unit test targeting this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


AXOM_HOST_DEVICE bool tryLock()
{
int still_locked = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance the axom atomics can be used/updated to handle/help with this logic?
(Mostly b/c that could harden the axom atomics. If you think this is a one-off and not useful elsewhere, it's fine as is)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding this to the axom atomics would be dependent on support from within RAJA for atomics with memory ordering. Otherwise the logic to implement that might get a little nasty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, RAJA default atomics don't support memory ordering. RAJA can be configured to use desul atomics, which do support memory ordering. Unfortunately, we only support using those through the original RAJA atomic interface and so we only provide a default we define: https://github.com/LLNL/RAJA/blob/develop/include/RAJA/policy/desul/atomic.hpp#L22.

We should revisit whether we want to switch to desul atomics by default in RAJA. I think the last time we discussed this, there were still some cases where RAJA atomics were faster than desul. If we did switch to desul by default (which is what Kokkos uses), then we could support the full desul interface.

@publixsubfan let me know if you think we should go this route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could play around with a partial desul default? Something like "default for ordered atomics, but use the original backend for unordered"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have a PR for the ordered atomics here: llnl/RAJA#1616, if we wanted to try and clean that up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- since this is somewhat of a one-off and it's not super easy to consolidate it into axom::atomics, I think it's fine as is.

for(int i = 0; i < NUM_ELEMS; i++)
{
auto key = this->getKey(i);
auto value = this->getValue(i * 10.0 + 5.0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have a test that has repeated value entries?

I'd expect it to be handled properly, but might be good to add a test for it anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice example!

@publixsubfan
Copy link
Contributor Author

@kennyweiss -- here's some performance graphs for construction and querying. These were "scaled" to be node-to-node comparisons, meaning we multiplied the ATS-2/ATS-4 numbers for each run by 4 to account for the 4 sockets. For CTS-2, we ran 2 MPI ranks with --cpu-bind=socket and 112 threads each, and summed the results for each run.

image image image

Copy link
Member

@BradWhitlock BradWhitlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition.

@publixsubfan publixsubfan force-pushed the feature/yang39/flatmap-gpu-insert branch from e574bfd to f83228f Compare July 24, 2025 04:01
@publixsubfan publixsubfan merged commit daba41b into develop Jul 24, 2025
15 checks passed
@kennyweiss kennyweiss deleted the feature/yang39/flatmap-gpu-insert branch July 25, 2025 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Core Issues related to Axom's 'core' component GPU Issues related to GPU development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants