[libcu++] Fix the default device pool getter#9351
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR refactors CUDA memory pool attribute handling and implements per-device default pool caching. It extracts attribute machinery and device-capability checks into a dedicated ChangesMemory pool attributes & per-device default pool
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 69610bcd-20e3-4725-b32b-96efd3cd99e7
📒 Files selected for processing (2)
libcudacxx/include/cuda/__memory_pool/device_memory_pool.hlibcudacxx/test/libcudacxx/cuda/memory_resource/resources/device_memory_resource.cu
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
miscco
left a comment
There was a problem hiding this comment.
I believe this is an overly complicated approach. We already have extensive machinery in place for devices, and I believe we should just store a cuMemPool_t there too that is guarded by std::once_flag
Then we only ever need to pull that once we need it
We need to create a default memory pool per device The best place is to create it in `physical_device` because there we already have a lot of the machinery to do it once in place
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
libcudacxx/include/cuda/__memory_pool/attributes.h (2)
136-168: ⚡ Quick winsuggestion: Namespace-scope
constexprvariables should useinline.Per coding guidelines: "All
constexprvariables at namespace/global scope must useinline, includingtemplatevariables."-static constexpr release_threshold_t release_threshold{}; +inline constexpr release_threshold_t release_threshold{};Apply similarly to all other
static constexprdeclarations in this namespace.Source: Coding guidelines
61-72: ⚡ Quick winsuggestion: Multiple functions in
attributes.hare missing_CCCL_HOST_APIannotations.Per coding guidelines, all functions must be marked with
_CCCL_HOST_API,_CCCL_DEVICE_API, or_CCCL_API. The following need annotations:
__pool_attr_impl::set(line 61)__set_attribute_non_zero_only(line 104)__pool_attr<::cudaMemPoolAttrReservedMemHigh>::set(line 117)__pool_attr<::cudaMemPoolAttrUsedMemHigh>::set(line 127)__is_host_memory_pool_supported(line 171)__verify_device_supports_stream_ordered_allocations(line 189)__verify_device_supports_export_handle_type(line 219)Source: Coding guidelines
libcudacxx/include/cuda/__memory_pool/device_memory_pool.h (1)
98-102: 💤 Low valuesuggestion: Use
::cuda::std::size_tinstead of plainsize_t.Per coding guidelines, standard integer type aliases should be fully qualified from
cuda::std.- const size_t __device_count = ::cuda::__physical_devices().size(); + const ::cuda::std::size_t __device_count = ::cuda::__physical_devices().size(); ::cuda::std::unique_ptr<::cuda::std::optional<device_memory_pool_ref>[]> __pools{ static_cast<::cuda::std::optional<device_memory_pool_ref>*>( ::operator new[](sizeof(::cuda::std::optional<device_memory_pool_ref>) * __device_count))}; - for (size_t __device = 0; __device < __device_count; ++__device) + for (::cuda::std::size_t __device = 0; __device < __device_count; ++__device)Source: Coding guidelines
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b8507be0-c926-4694-aa84-8309f1952215
📒 Files selected for processing (10)
libcudacxx/include/cuda/__device/physical_device.hlibcudacxx/include/cuda/__memory_pool/attributes.hlibcudacxx/include/cuda/__memory_pool/device_memory_pool.hlibcudacxx/include/cuda/__memory_pool/managed_memory_pool.hlibcudacxx/include/cuda/__memory_pool/memory_pool_base.hlibcudacxx/include/cuda/__memory_pool/pinned_memory_pool.hlibcudacxx/include/cuda/__memory_pool/shared_device_memory_pool.hlibcudacxx/include/cuda/__memory_pool/shared_managed_memory_pool.hlibcudacxx/include/cuda/__memory_pool/shared_memory_pool_base.hlibcudacxx/include/cuda/__memory_pool/shared_pinned_memory_pool.h
✅ Files skipped from review due to trivial changes (5)
- libcudacxx/include/cuda/__memory_pool/shared_memory_pool_base.h
- libcudacxx/include/cuda/__memory_pool/shared_pinned_memory_pool.h
- libcudacxx/include/cuda/__memory_pool/managed_memory_pool.h
- libcudacxx/include/cuda/__memory_pool/shared_device_memory_pool.h
- libcudacxx/include/cuda/__memory_pool/pinned_memory_pool.h
This comment has been minimized.
This comment has been minimized.
| static ::cuda::std::unique_ptr<::cuda::std::optional<device_memory_pool_ref>[]> __pools_ = []() { | ||
| const size_t __device_count = ::cuda::__physical_devices().size(); | ||
| ::cuda::std::unique_ptr<::cuda::std::optional<device_memory_pool_ref>[]> __pools{ | ||
| static_cast<::cuda::std::optional<device_memory_pool_ref>*>( | ||
| ::operator new[](sizeof(::cuda::std::optional<device_memory_pool_ref>) * __device_count))}; | ||
| for (size_t __device = 0; __device < __device_count; ++__device) | ||
| { | ||
| ::cuda::std::__construct_at(__pools.get() + __device, ::cuda::std::nullopt); | ||
| } | ||
| return __pools; | ||
| }(); | ||
|
|
||
| auto& __pool = __pools_[__device.get()]; | ||
| if (!__pool.has_value()) | ||
| { | ||
| __pool.emplace(::cuda::__physical_devices()[__device.get()].__get_default_memory_pool()); | ||
| } | ||
| return *__pool; |
There was a problem hiding this comment.
I believe we should still have an array of once_flag to make sure we initialize each optional only once even when 2 threads would execute this code at the same time
There was a problem hiding this comment.
I disagree, we only ever write the same bit patter, so there is no observable race here
There was a problem hiding this comment.
The worst cast scenario is that 2 threads write ptr, true at the same time
There was a problem hiding this comment.
The worst cast scenario is that 2 threads write ptr, true at the same time
Strong disagree. We should still be semantically correct. If users are using TSAN then this will (rightly) report a race.
| ::operator new[](sizeof(::cuda::std::optional<device_memory_pool_ref>) * __device_count))}; | ||
| for (size_t __device = 0; __device < __device_count; ++__device) | ||
| { | ||
| ::cuda::std::__construct_at(__pools.get() + __device, ::cuda::std::nullopt); | ||
| } |
There was a problem hiding this comment.
Why can't we just use regular new[__device_count] here?
There was a problem hiding this comment.
because we do not want to create pools in all devices when no one asked for them
we only create the ones for those that are actually asked for
There was a problem hiding this comment.
Yeah but its an optional now. We can default construct those (they'll just be nullopt)
| return ::cuda::std::span<const device_ref>{__peers_.get(), __num_peers_}; | ||
| } | ||
|
|
||
| [[nodiscard]] _CCCL_HOST_API ::cudaMemPool_t __get_default_memory_pool() |
There was a problem hiding this comment.
Why do we need to cache it twice, once as cudaMempool_t and once as device_memory_pool_ref? I would move the init_once to the mempool getter and remove this, so we end up caching only once and we fix the race, win win
There was a problem hiding this comment.
We can do that, which would also make the others happy. I do not really care too much honestly
This comment has been minimized.
This comment has been minimized.
| ::std::call_once(__once_, [this, __device]() { | ||
| this->__init(__device); | ||
| }); |
There was a problem hiding this comment.
Do we really need to go through this whole shebang? I feel like a unique_ptr<optional<cudaMemPool_t>[]> with a small mutex zone is far more readable/maintainable than this multi-step process. Something like
static unique_ptr<optional<::cudaMemPool_t>[]> __pools_ = ::new optional<::cudaMemPool_t>[::cuda::__physical_devices_count()];
static ::std::mutex mut;
const auto _ = std::lock_guard{mut};
auto& __p_opt = __pools_[__device.get()];
if (!__p_opt.has_value()) {
__p_opt.emplace(/* create mempool */);
}
return device_memory_pool_ref{*__p_opt};I realize this changes the signature to returning a device_memory_pool_ref by value, but these are trivially cheap to construct anyhow. I don't think anyone is relying on the fact its exactly a lvalue ref.
There was a problem hiding this comment.
If we return it by value its a footgun with resource_ref like:
auto ref = cuda::mr::resource_ref{cuda::device_default_memory_pool(dev0)};
Will have a dangling reference. I think there is value in returning it by reference and it's not like the current implementation is crazy complex.
Also I wanted an union to avoid pulling in the optional, I feel like its an overkill here
There was a problem hiding this comment.
OK you can still return a reference since we leak these mempool handles anyways:
static unique_ptr<optional<device_memory_pool_ref>[]> __pools_ = ::new optional<device_memory_pool_ref>[::cuda::__physical_devices_count()];
static ::std::mutex mut;
const auto _ = std::lock_guard{mut};
auto& __p_opt = __pools_[__device.get()];
if (!__p_opt.has_value()) {
cudaMemPool_t pool = /* create mempool */;
__p_opt.emplace(pool);
// and leak it
}
return *__p_opt;Up to you in the end, but the above is IMO much more readable.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
🥳 CI Workflow Results🟩 Finished in 20h 22m: Pass: 100%/118 | Total: 2d 15h | Max: 1h 23m | Hits: 66%/552319See results here. |
|
Successfully created backport PR for |
|
Successfully created backport PR for |

We cache default mempools in function local statics. While this is fine for managed and pinned, where there is only one pool, for device its wrong. Whatever device is specified to the first call to this function will be used to get a mempool that will be returned no matter what device is specified later. This PR fixes that by adding a per-device cache.
We unfortunately need to cache the mempool, because we return a
device_memory_pool_ref&to play nicer withresource_ref. For now I made a localized change in the mempool header to make it easier to back-port, but long term we can thing about moving the storage to the physical device class, but it has its own set of problems with header dependency or a need for type-erased storage.