Skip to content

[libcu++] Fix the default device pool getter#9351

Merged
pciolkosz merged 9 commits into
NVIDIA:mainfrom
pciolkosz:fix_device_default_mempool_getter
Jun 11, 2026
Merged

[libcu++] Fix the default device pool getter#9351
pciolkosz merged 9 commits into
NVIDIA:mainfrom
pciolkosz:fix_device_default_mempool_getter

Conversation

@pciolkosz

Copy link
Copy Markdown
Contributor

We cache default mempools in function local statics. While this is fine for managed and pinned, where there is only one pool, for device its wrong. Whatever device is specified to the first call to this function will be used to get a mempool that will be returned no matter what device is specified later. This PR fixes that by adding a per-device cache.

We unfortunately need to cache the mempool, because we return a device_memory_pool_ref& to play nicer with resource_ref. For now I made a localized change in the mempool header to make it easier to back-port, but long term we can thing about moving the storage to the physical device class, but it has its own set of problems with header dependency or a need for type-erased storage.

@pciolkosz pciolkosz requested a review from a team as a code owner June 10, 2026 00:04
@pciolkosz pciolkosz requested a review from Jacobfaib June 10, 2026 00:04
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 10, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 10, 2026
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR refactors CUDA memory pool attribute handling and implements per-device default pool caching. It extracts attribute machinery and device-capability checks into a dedicated attributes.h header, removes duplicated code from memory_pool_base.h, adds per-device caching infrastructure to physical_device, reimplements device_default_memory_pool with per-device storage, and updates downstream headers with the new include dependency.

Changes

Memory pool attributes & per-device default pool

Layer / File(s) Summary
Add memory-pool attributes and helpers header
libcudacxx/include/cuda/__memory_pool/attributes.h
New header exposing __pool_attr template machinery for reading and conditionally setting cudaMemPoolAttr values, memory_pool_attributes typed aliases and constexpr objects for common attributes, device capability checks (__verify_device_supports_stream_ordered_allocations, __verify_device_supports_export_handle_type), and __get_default_memory_pool resolver with CUDA-version-gated behavior for Toolkit >= 12.9 and >= 13.0.
Extract attributes from memory_pool_base.h
libcudacxx/include/cuda/__memory_pool/memory_pool_base.h
Remove internal attribute templates, memory_pool_attributes namespace, and device-support helper functions; include the new attributes header instead to consolidate machinery.
physical_device: per-device default pool cache and getter
libcudacxx/include/cuda/__device/physical_device.h
Add hosted-only std::once_flag and cudaMemPool_t members; implement __get_default_memory_pool() method using std::call_once in hosted builds to lazily retrieve and cache the device's default memory pool via cuda::__get_default_memory_pool with CUmemLocation targeting the current device and pinned allocation type.
Per-device default pool cache in device_memory_pool.h
libcudacxx/include/cuda/__memory_pool/device_memory_pool.h
Add C++ standard utility includes; reimplement device_default_memory_pool(device_ref) to maintain a static array of optional entries (one per physical device) lazily initialized on first access, replacing the single process-wide static pool and delegating to the device's __get_default_memory_pool().
Propagate attributes include to pool headers
libcudacxx/include/cuda/__memory_pool/managed_memory_pool.h, pinned_memory_pool.h, shared_device_memory_pool.h, shared_memory_pool_base.h, shared_pinned_memory_pool.h
Add conditional includes of cuda/__memory_pool/attributes.h under appropriate CTK version and toolkit availability guards.
Test per-device default pool isolation
libcudacxx/test/libcudacxx/cuda/memory_resource/resources/device_memory_resource.cu
Add cuda/devices include; extend "device_memory_pool comparison" test with conditional assertions (when cuda::devices.size() > 1) validating that default pools differ across physical devices via pointer inequality and ==/!= operator semantics.

Suggested reviewers

  • Jacobfaib
  • davebayer

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 69610bcd-20e3-4725-b32b-96efd3cd99e7

📥 Commits

Reviewing files that changed from the base of the PR and between 087c594 and 1d369a1.

📒 Files selected for processing (2)
  • libcudacxx/include/cuda/__memory_pool/device_memory_pool.h
  • libcudacxx/test/libcudacxx/cuda/memory_resource/resources/device_memory_resource.cu

Comment thread libcudacxx/include/cuda/__memory_pool/device_memory_pool.h Outdated
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Comment thread libcudacxx/include/cuda/__memory_pool/device_memory_pool.h Outdated
Comment thread libcudacxx/include/cuda/__memory_pool/device_memory_pool.h Outdated
Comment thread libcudacxx/include/cuda/__memory_pool/device_memory_pool.h Outdated
Comment thread libcudacxx/include/cuda/__memory_pool/device_memory_pool.h Outdated

@miscco miscco left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is an overly complicated approach. We already have extensive machinery in place for devices, and I believe we should just store a cuMemPool_t there too that is guarded by std::once_flag

Then we only ever need to pull that once we need it

miscco added 2 commits June 10, 2026 10:16
We need to create a default memory pool per device

The best place is to create it in `physical_device` because there we already have a lot of the machinery to do it once in place

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
libcudacxx/include/cuda/__memory_pool/attributes.h (2)

136-168: ⚡ Quick win

suggestion: Namespace-scope constexpr variables should use inline.

Per coding guidelines: "All constexpr variables at namespace/global scope must use inline, including template variables."

-static constexpr release_threshold_t release_threshold{};
+inline constexpr release_threshold_t release_threshold{};

Apply similarly to all other static constexpr declarations in this namespace.

Source: Coding guidelines


61-72: ⚡ Quick win

suggestion: Multiple functions in attributes.h are missing _CCCL_HOST_API annotations.

Per coding guidelines, all functions must be marked with _CCCL_HOST_API, _CCCL_DEVICE_API, or _CCCL_API. The following need annotations:

  • __pool_attr_impl::set (line 61)
  • __set_attribute_non_zero_only (line 104)
  • __pool_attr<::cudaMemPoolAttrReservedMemHigh>::set (line 117)
  • __pool_attr<::cudaMemPoolAttrUsedMemHigh>::set (line 127)
  • __is_host_memory_pool_supported (line 171)
  • __verify_device_supports_stream_ordered_allocations (line 189)
  • __verify_device_supports_export_handle_type (line 219)

Source: Coding guidelines

libcudacxx/include/cuda/__memory_pool/device_memory_pool.h (1)

98-102: 💤 Low value

suggestion: Use ::cuda::std::size_t instead of plain size_t.

Per coding guidelines, standard integer type aliases should be fully qualified from cuda::std.

-    const size_t __device_count = ::cuda::__physical_devices().size();
+    const ::cuda::std::size_t __device_count = ::cuda::__physical_devices().size();
     ::cuda::std::unique_ptr<::cuda::std::optional<device_memory_pool_ref>[]> __pools{
       static_cast<::cuda::std::optional<device_memory_pool_ref>*>(
         ::operator new[](sizeof(::cuda::std::optional<device_memory_pool_ref>) * __device_count))};
-    for (size_t __device = 0; __device < __device_count; ++__device)
+    for (::cuda::std::size_t __device = 0; __device < __device_count; ++__device)

Source: Coding guidelines


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b8507be0-c926-4694-aa84-8309f1952215

📥 Commits

Reviewing files that changed from the base of the PR and between 920a05a and cd1f763.

📒 Files selected for processing (10)
  • libcudacxx/include/cuda/__device/physical_device.h
  • libcudacxx/include/cuda/__memory_pool/attributes.h
  • libcudacxx/include/cuda/__memory_pool/device_memory_pool.h
  • libcudacxx/include/cuda/__memory_pool/managed_memory_pool.h
  • libcudacxx/include/cuda/__memory_pool/memory_pool_base.h
  • libcudacxx/include/cuda/__memory_pool/pinned_memory_pool.h
  • libcudacxx/include/cuda/__memory_pool/shared_device_memory_pool.h
  • libcudacxx/include/cuda/__memory_pool/shared_managed_memory_pool.h
  • libcudacxx/include/cuda/__memory_pool/shared_memory_pool_base.h
  • libcudacxx/include/cuda/__memory_pool/shared_pinned_memory_pool.h
✅ Files skipped from review due to trivial changes (5)
  • libcudacxx/include/cuda/__memory_pool/shared_memory_pool_base.h
  • libcudacxx/include/cuda/__memory_pool/shared_pinned_memory_pool.h
  • libcudacxx/include/cuda/__memory_pool/managed_memory_pool.h
  • libcudacxx/include/cuda/__memory_pool/shared_device_memory_pool.h
  • libcudacxx/include/cuda/__memory_pool/pinned_memory_pool.h

Comment thread libcudacxx/include/cuda/__memory_pool/attributes.h Outdated
Comment thread libcudacxx/include/cuda/__memory_pool/shared_managed_memory_pool.h Outdated
@github-actions

This comment has been minimized.

Comment on lines +97 to +114
static ::cuda::std::unique_ptr<::cuda::std::optional<device_memory_pool_ref>[]> __pools_ = []() {
const size_t __device_count = ::cuda::__physical_devices().size();
::cuda::std::unique_ptr<::cuda::std::optional<device_memory_pool_ref>[]> __pools{
static_cast<::cuda::std::optional<device_memory_pool_ref>*>(
::operator new[](sizeof(::cuda::std::optional<device_memory_pool_ref>) * __device_count))};
for (size_t __device = 0; __device < __device_count; ++__device)
{
::cuda::std::__construct_at(__pools.get() + __device, ::cuda::std::nullopt);
}
return __pools;
}();

auto& __pool = __pools_[__device.get()];
if (!__pool.has_value())
{
__pool.emplace(::cuda::__physical_devices()[__device.get()].__get_default_memory_pool());
}
return *__pool;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should still have an array of once_flag to make sure we initialize each optional only once even when 2 threads would execute this code at the same time

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, we only ever write the same bit patter, so there is no observable race here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worst cast scenario is that 2 threads write ptr, true at the same time

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worst cast scenario is that 2 threads write ptr, true at the same time

Strong disagree. We should still be semantically correct. If users are using TSAN then this will (rightly) report a race.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to mention the rare (but possible!) case of a thread writing the true but being suspended before it can write ptr. Or writing both true and ptr but these straddle a cache-line, and only the cache-line with true getting broadcast to other cores in time.

61N2hybQlML _AC_SL1500___56884

Comment on lines +101 to +105
::operator new[](sizeof(::cuda::std::optional<device_memory_pool_ref>) * __device_count))};
for (size_t __device = 0; __device < __device_count; ++__device)
{
::cuda::std::__construct_at(__pools.get() + __device, ::cuda::std::nullopt);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just use regular new[__device_count] here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we do not want to create pools in all devices when no one asked for them

we only create the ones for those that are actually asked for

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but its an optional now. We can default construct those (they'll just be nullopt)

return ::cuda::std::span<const device_ref>{__peers_.get(), __num_peers_};
}

[[nodiscard]] _CCCL_HOST_API ::cudaMemPool_t __get_default_memory_pool()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to cache it twice, once as cudaMempool_t and once as device_memory_pool_ref? I would move the init_once to the mempool getter and remove this, so we end up caching only once and we fix the race, win win

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that, which would also make the others happy. I do not really care too much honestly

@github-actions

This comment has been minimized.

Comment on lines +128 to +130
::std::call_once(__once_, [this, __device]() {
this->__init(__device);
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to go through this whole shebang? I feel like a unique_ptr<optional<cudaMemPool_t>[]> with a small mutex zone is far more readable/maintainable than this multi-step process. Something like

static unique_ptr<optional<::cudaMemPool_t>[]> __pools_ = ::new optional<::cudaMemPool_t>[::cuda::__physical_devices_count()];
static ::std::mutex mut;

const auto _  = std::lock_guard{mut};
auto& __p_opt = __pools_[__device.get()];

if (!__p_opt.has_value()) {
  __p_opt.emplace(/* create mempool */);
}
return device_memory_pool_ref{*__p_opt};

I realize this changes the signature to returning a device_memory_pool_ref by value, but these are trivially cheap to construct anyhow. I don't think anyone is relying on the fact its exactly a lvalue ref.

@pciolkosz pciolkosz Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return it by value its a footgun with resource_ref like:

auto ref = cuda::mr::resource_ref{cuda::device_default_memory_pool(dev0)};

Will have a dangling reference. I think there is value in returning it by reference and it's not like the current implementation is crazy complex.
Also I wanted an union to avoid pulling in the optional, I feel like its an overkill here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK you can still return a reference since we leak these mempool handles anyways:

static unique_ptr<optional<device_memory_pool_ref>[]> __pools_ = ::new optional<device_memory_pool_ref>[::cuda::__physical_devices_count()];
static ::std::mutex mut;

const auto _  = std::lock_guard{mut};
auto& __p_opt = __pools_[__device.get()];

if (!__p_opt.has_value()) {
  cudaMemPool_t pool = /* create mempool */;
  __p_opt.emplace(pool);
  // and leak it
}
return *__p_opt;

Up to you in the end, but the above is IMO much more readable.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 20h 22m: Pass: 100%/118 | Total: 2d 15h | Max: 1h 23m | Hits: 66%/552319

See results here.

@pciolkosz pciolkosz merged commit 1028921 into NVIDIA:main Jun 11, 2026
399 of 405 checks passed
@github-project-automation github-project-automation Bot moved this from In Review to Done in CCCL Jun 11, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Successfully created backport PR for branch/3.3.x:

@github-actions

Copy link
Copy Markdown
Contributor

Successfully created backport PR for branch/3.4.x:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants