[libcu++] Fix the default device pool getter by pciolkosz · Pull Request #9351 · NVIDIA/cccl

pciolkosz · 2026-06-10T00:04:01Z

We cache default mempools in function local statics. While this is fine for managed and pinned, where there is only one pool, for device its wrong. Whatever device is specified to the first call to this function will be used to get a mempool that will be returned no matter what device is specified later. This PR fixes that by adding a per-device cache.

We unfortunately need to cache the mempool, because we return a device_memory_pool_ref& to play nicer with resource_ref. For now I made a localized change in the mempool header to make it easier to back-port, but long term we can thing about moving the storage to the physical device class, but it has its own set of problems with header dependency or a need for type-erased storage.

coderabbitai · 2026-06-10T00:12:33Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This PR refactors CUDA memory pool attribute handling and implements per-device default pool caching. It extracts attribute machinery and device-capability checks into a dedicated attributes.h header, removes duplicated code from memory_pool_base.h, adds per-device caching infrastructure to physical_device, reimplements device_default_memory_pool with per-device storage, and updates downstream headers with the new include dependency.

Changes

Memory pool attributes & per-device default pool

Layer / File(s)	Summary
Add memory-pool attributes and helpers header `libcudacxx/include/cuda/__memory_pool/attributes.h`	New header exposing `__pool_attr` template machinery for reading and conditionally setting `cudaMemPoolAttr` values, `memory_pool_attributes` typed aliases and constexpr objects for common attributes, device capability checks (`__verify_device_supports_stream_ordered_allocations`, `__verify_device_supports_export_handle_type`), and `__get_default_memory_pool` resolver with CUDA-version-gated behavior for Toolkit >= 12.9 and >= 13.0.
Extract attributes from memory_pool_base.h `libcudacxx/include/cuda/__memory_pool/memory_pool_base.h`	Remove internal attribute templates, `memory_pool_attributes` namespace, and device-support helper functions; include the new attributes header instead to consolidate machinery.
physical_device: per-device default pool cache and getter `libcudacxx/include/cuda/__device/physical_device.h`	Add hosted-only `std::once_flag` and `cudaMemPool_t` members; implement `__get_default_memory_pool()` method using `std::call_once` in hosted builds to lazily retrieve and cache the device's default memory pool via `cuda::__get_default_memory_pool` with `CUmemLocation` targeting the current device and pinned allocation type.
Per-device default pool cache in device_memory_pool.h `libcudacxx/include/cuda/__memory_pool/device_memory_pool.h`	Add C++ standard utility includes; reimplement `device_default_memory_pool(device_ref)` to maintain a static array of `optional` entries (one per physical device) lazily initialized on first access, replacing the single process-wide static pool and delegating to the device's `__get_default_memory_pool()`.
Propagate attributes include to pool headers `libcudacxx/include/cuda/__memory_pool/managed_memory_pool.h`, `pinned_memory_pool.h`, `shared_device_memory_pool.h`, `shared_memory_pool_base.h`, `shared_pinned_memory_pool.h`	Add conditional includes of `cuda/__memory_pool/attributes.h` under appropriate CTK version and toolkit availability guards.
Test per-device default pool isolation `libcudacxx/test/libcudacxx/cuda/memory_resource/resources/device_memory_resource.cu`	Add `cuda/devices` include; extend "device_memory_pool comparison" test with conditional assertions (when `cuda::devices.size() > 1`) validating that default pools differ across physical devices via pointer inequality and `==`/`!=` operator semantics.

Suggested reviewers

Jacobfaib
davebayer

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 69610bcd-20e3-4725-b32b-96efd3cd99e7

📥 Commits

Reviewing files that changed from the base of the PR and between 087c594 and 1d369a1.

📒 Files selected for processing (2)

libcudacxx/include/cuda/__memory_pool/device_memory_pool.h
libcudacxx/test/libcudacxx/cuda/memory_resource/resources/device_memory_resource.cu

miscco

I believe this is an overly complicated approach. We already have extensive machinery in place for devices, and I believe we should just store a cuMemPool_t there too that is guarded by std::once_flag

Then we only ever need to pull that once we need it

We need to create a default memory pool per device The best place is to create it in `physical_device` because there we already have a lot of the machinery to do it once in place

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

libcudacxx/include/cuda/__memory_pool/attributes.h (2)
136-168: ⚡ Quick win

suggestion: Namespace-scope constexpr variables should use inline.

Per coding guidelines: "All constexpr variables at namespace/global scope must use inline, including template variables."
-static constexpr release_threshold_t release_threshold{};
+inline constexpr release_threshold_t release_threshold{};
Apply similarly to all other static constexpr declarations in this namespace.

Source: Coding guidelines

61-72: ⚡ Quick win

suggestion: Multiple functions in attributes.h are missing _CCCL_HOST_API annotations.

Per coding guidelines, all functions must be marked with _CCCL_HOST_API, _CCCL_DEVICE_API, or _CCCL_API. The following need annotations:

__pool_attr_impl::set (line 61)

__set_attribute_non_zero_only (line 104)

__pool_attr<::cudaMemPoolAttrReservedMemHigh>::set (line 117)

__pool_attr<::cudaMemPoolAttrUsedMemHigh>::set (line 127)

__is_host_memory_pool_supported (line 171)

__verify_device_supports_stream_ordered_allocations (line 189)

__verify_device_supports_export_handle_type (line 219)

Source: Coding guidelines
libcudacxx/include/cuda/__memory_pool/device_memory_pool.h (1)
98-102: 💤 Low value

suggestion: Use ::cuda::std::size_t instead of plain size_t.

Per coding guidelines, standard integer type aliases should be fully qualified from cuda::std.
-    const size_t __device_count = ::cuda::__physical_devices().size();
+    const ::cuda::std::size_t __device_count = ::cuda::__physical_devices().size();
     ::cuda::std::unique_ptr<::cuda::std::optional<device_memory_pool_ref>[]> __pools{
       static_cast<::cuda::std::optional<device_memory_pool_ref>*>(
         ::operator new[](sizeof(::cuda::std::optional<device_memory_pool_ref>) * __device_count))};
-    for (size_t __device = 0; __device < __device_count; ++__device)
+    for (::cuda::std::size_t __device = 0; __device < __device_count; ++__device)
Source: Coding guidelines

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b8507be0-c926-4694-aa84-8309f1952215

📥 Commits

Reviewing files that changed from the base of the PR and between 920a05a and cd1f763.

📒 Files selected for processing (10)

libcudacxx/include/cuda/__device/physical_device.h
libcudacxx/include/cuda/__memory_pool/attributes.h
libcudacxx/include/cuda/__memory_pool/device_memory_pool.h
libcudacxx/include/cuda/__memory_pool/managed_memory_pool.h
libcudacxx/include/cuda/__memory_pool/memory_pool_base.h
libcudacxx/include/cuda/__memory_pool/pinned_memory_pool.h
libcudacxx/include/cuda/__memory_pool/shared_device_memory_pool.h
libcudacxx/include/cuda/__memory_pool/shared_managed_memory_pool.h
libcudacxx/include/cuda/__memory_pool/shared_memory_pool_base.h
libcudacxx/include/cuda/__memory_pool/shared_pinned_memory_pool.h

✅ Files skipped from review due to trivial changes (5)

libcudacxx/include/cuda/__memory_pool/shared_memory_pool_base.h
libcudacxx/include/cuda/__memory_pool/shared_pinned_memory_pool.h
libcudacxx/include/cuda/__memory_pool/managed_memory_pool.h
libcudacxx/include/cuda/__memory_pool/shared_device_memory_pool.h
libcudacxx/include/cuda/__memory_pool/pinned_memory_pool.h

davebayer · 2026-06-10T11:16:23Z

+  static ::cuda::std::unique_ptr<::cuda::std::optional<device_memory_pool_ref>[]> __pools_ = []() {
+    const size_t __device_count = ::cuda::__physical_devices().size();
+    ::cuda::std::unique_ptr<::cuda::std::optional<device_memory_pool_ref>[]> __pools{
+      static_cast<::cuda::std::optional<device_memory_pool_ref>*>(
+        ::operator new[](sizeof(::cuda::std::optional<device_memory_pool_ref>) * __device_count))};
+    for (size_t __device = 0; __device < __device_count; ++__device)
+    {
+      ::cuda::std::__construct_at(__pools.get() + __device, ::cuda::std::nullopt);
+    }
+    return __pools;
+  }();
+
+  auto& __pool = __pools_[__device.get()];
+  if (!__pool.has_value())
+  {
+    __pool.emplace(::cuda::__physical_devices()[__device.get()].__get_default_memory_pool());
+  }
+  return *__pool;


I believe we should still have an array of once_flag to make sure we initialize each optional only once even when 2 threads would execute this code at the same time

I disagree, we only ever write the same bit patter, so there is no observable race here

The worst cast scenario is that 2 threads write ptr, true at the same time

The worst cast scenario is that 2 threads write ptr, true at the same time

Strong disagree. We should still be semantically correct. If users are using TSAN then this will (rightly) report a race.

Not to mention the rare (but possible!) case of a thread writing the true but being suspended before it can write ptr. Or writing both true and ptr but these straddle a cache-line, and only the cache-line with true getting broadcast to other cores in time.

Jacobfaib · 2026-06-10T13:35:05Z

+        ::operator new[](sizeof(::cuda::std::optional<device_memory_pool_ref>) * __device_count))};
+    for (size_t __device = 0; __device < __device_count; ++__device)
+    {
+      ::cuda::std::__construct_at(__pools.get() + __device, ::cuda::std::nullopt);
+    }


Why can't we just use regular new[__device_count] here?

because we do not want to create pools in all devices when no one asked for them

we only create the ones for those that are actually asked for

Yeah but its an optional now. We can default construct those (they'll just be nullopt)

pciolkosz · 2026-06-10T16:41:44Z

    return ::cuda::std::span<const device_ref>{__peers_.get(), __num_peers_};
  }
+
+  [[nodiscard]] _CCCL_HOST_API ::cudaMemPool_t __get_default_memory_pool()


Why do we need to cache it twice, once as cudaMempool_t and once as device_memory_pool_ref? I would move the init_once to the mempool getter and remove this, so we end up caching only once and we fix the race, win win

We can do that, which would also make the others happy. I do not really care too much honestly

Jacobfaib · 2026-06-10T20:22:25Z

+    ::std::call_once(__once_, [this, __device]() {
+      this->__init(__device);
+    });


Do we really need to go through this whole shebang? I feel like a unique_ptr<optional<cudaMemPool_t>[]> with a small mutex zone is far more readable/maintainable than this multi-step process. Something like

static unique_ptr<optional<::cudaMemPool_t>[]> __pools_ = ::new optional<::cudaMemPool_t>[::cuda::__physical_devices_count()]; static ::std::mutex mut; const auto _ = std::lock_guard{mut}; auto& __p_opt = __pools_[__device.get()]; if (!__p_opt.has_value()) { __p_opt.emplace(/* create mempool */); } return device_memory_pool_ref{*__p_opt};

I realize this changes the signature to returning a device_memory_pool_ref by value, but these are trivially cheap to construct anyhow. I don't think anyone is relying on the fact its exactly a lvalue ref.

If we return it by value its a footgun with resource_ref like:

auto ref = cuda::mr::resource_ref{cuda::device_default_memory_pool(dev0)};

Will have a dangling reference. I think there is value in returning it by reference and it's not like the current implementation is crazy complex.
Also I wanted an union to avoid pulling in the optional, I feel like its an overkill here

OK you can still return a reference since we leak these mempool handles anyways:

static unique_ptr<optional<device_memory_pool_ref>[]> __pools_ = ::new optional<device_memory_pool_ref>[::cuda::__physical_devices_count()]; static ::std::mutex mut; const auto _ = std::lock_guard{mut}; auto& __p_opt = __pools_[__device.get()]; if (!__p_opt.has_value()) { cudaMemPool_t pool = /* create mempool */; __p_opt.emplace(pool); // and leak it } return *__p_opt;

Up to you in the end, but the above is IMO much more readable.

github-actions · 2026-06-11T18:39:24Z

🥳 CI Workflow Results

🟩 Finished in 20h 22m: Pass: 100%/118 | Total: 2d 15h | Max: 1h 23m | Hits: 66%/552319

See results here.

github-actions · 2026-06-11T18:46:36Z

Successfully created backport PR for branch/3.3.x:

[Backport branch/3.3.x] [libcu++] Fix the default device pool getter #9411

github-actions · 2026-06-11T18:46:42Z

Successfully created backport PR for branch/3.4.x:

[Backport branch/3.4.x] [libcu++] Fix the default device pool getter #9412

Fix the default device pool getter

1d369a1

pciolkosz requested a review from a team as a code owner June 10, 2026 00:04

pciolkosz requested a review from Jacobfaib June 10, 2026 00:04

github-project-automation Bot added this to CCCL Jun 10, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 10, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 10, 2026

fix format

561311a

pciolkosz added backport branch/3.4.x backport branch/3.3.x backport branch/3.2.x labels Jun 10, 2026

pciolkosz requested review from davebayer and miscco June 10, 2026 00:10

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread libcudacxx/include/cuda/__memory_pool/device_memory_pool.h Outdated

This comment has been minimized.

Sign in to view

Fix clang-tidy

920a05a

This comment has been minimized.

Sign in to view

davebayer reviewed Jun 10, 2026

View reviewed changes

Comment thread libcudacxx/include/cuda/__memory_pool/device_memory_pool.h Outdated

Comment thread libcudacxx/include/cuda/__memory_pool/device_memory_pool.h Outdated

davebayer reviewed Jun 10, 2026

View reviewed changes

Comment thread libcudacxx/include/cuda/__memory_pool/device_memory_pool.h Outdated

miscco reviewed Jun 10, 2026

View reviewed changes

Comment thread libcudacxx/include/cuda/__memory_pool/device_memory_pool.h Outdated

miscco reviewed Jun 10, 2026

View reviewed changes

miscco added 2 commits June 10, 2026 10:16

[libcu++] Move memory pool atrributes into their own file

0c178d1

[libcu++] Properly initialize the default memory pool

cd1f763

We need to create a default memory pool per device The best place is to create it in `physical_device` because there we already have a lot of the machinery to do it once in place

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread libcudacxx/include/cuda/__memory_pool/attributes.h Outdated

Comment thread libcudacxx/include/cuda/__memory_pool/shared_managed_memory_pool.h Outdated

miscco added 2 commits June 10, 2026 11:05

Fix header guards

b001c10

Drop include

d178b1a

This comment has been minimized.

Sign in to view

davebayer reviewed Jun 10, 2026

View reviewed changes

Jacobfaib approved these changes Jun 10, 2026

View reviewed changes

pciolkosz commented Jun 10, 2026

View reviewed changes

Remove the physical device bits and use init_once in the pool getter

f63c146

davebayer approved these changes Jun 10, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

Jacobfaib reviewed Jun 10, 2026

View reviewed changes

pciolkosz removed the backport branch/3.2.x label Jun 10, 2026

This comment has been minimized.

Sign in to view

Merge branch 'main' into fix_device_default_mempool_getter

cff2565

This comment has been minimized.

Sign in to view

miscco approved these changes Jun 11, 2026

View reviewed changes

pciolkosz merged commit 1028921 into NVIDIA:main Jun 11, 2026
399 of 405 checks passed

github-project-automation Bot moved this from In Review to Done in CCCL Jun 11, 2026

github-actions Bot mentioned this pull request Jun 11, 2026

[Backport branch/3.3.x] [libcu++] Fix the default device pool getter #9411

Open

github-actions Bot mentioned this pull request Jun 11, 2026

[Backport branch/3.4.x] [libcu++] Fix the default device pool getter #9412

Open

Conversation

pciolkosz commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

miscco left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

pciolkosz Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions Bot commented Jun 11, 2026

🥳 CI Workflow Results

🟩 Finished in 20h 22m: Pass: 100%/118 | Total: 2d 15h | Max: 1h 23m | Hits: 66%/552319

Uh oh!

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

pciolkosz Jun 10, 2026 •

edited

Loading