Add ManagedAllocator support#1524
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds wp.ManagedAllocator, AllocationKind, array.allocation_kind, device managed-memory capability flags, native managed alloc/free and capture-aware handling, allocation-kind-based access classification and IPC rejection, tests, and documentation/changelog updates. ChangesExplicit CUDA Managed-Memory Allocation Support
Sequence Diagram: Managed allocation + launch access check sequenceDiagram
participant Python as wp.ManagedAllocator()
participant Context as WarpContext
participant Native as wp_alloc_device_managed
participant Tracker as AllocationTracker
participant Device as CUDA_Device
Python->>Context: allocate(bytes)
Context->>Native: wp_alloc_device_managed(context, size, tag)
Native->>Device: cudaMallocFromPoolAsync / cudaMallocManaged
Device-->>Native: ptr
Native->>Tracker: record(ptr -> owner_context, pool_or_direct)
Native-->>Context: ptr
Context-->>Python: array(ptr) with allocation_kind
Python->>Context: wp.can_access(device, array)
Context->>Tracker: _get_array_allocation_kind(array)
Context->>Device: query is_managed_memory_supported / is_concurrent_managed_access_supported
Context-->>Python: ACCESSIBLE / INACCESSIBLE (with allocation_kind detail)
🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
2229a1b to
e428614
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
warp/_src/context.py (1)
8172-8213:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse
allocation_kindfor wrapped CUDA allocations too.This branch still relies on
isinstance(allocator, CudaMempoolAllocator/CudaDefaultAllocator). Wrapped arrays can carryCUDA_MEMPOOLorCUDA_DEVICEprovenance without a live allocator object, sowp.can_access()and checked launches will fall through toUNKNOWNinstead of using the right mempool/peer-access test.Suggested fix
- if isinstance(allocator, CudaMempoolAllocator): + if allocation_kind == _AllocationKind.CUDA_MEMPOOL or isinstance(allocator, CudaMempoolAllocator): if is_mempool_access_enabled(value_device, device): return _ArrayAccessStatus.ACCESSIBLE return _ArrayAccessStatus.INACCESSIBLE - if isinstance(allocator, CudaDefaultAllocator): + if allocation_kind == _AllocationKind.CUDA_DEVICE or isinstance(allocator, CudaDefaultAllocator): if is_peer_access_enabled(value_device, device): return _ArrayAccessStatus.ACCESSIBLE return _ArrayAccessStatus.INACCESSIBLE🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@warp/_src/context.py` around lines 8172 - 8213, The CUDA-CUDA branch currently uses isinstance(allocator, CudaMempoolAllocator/CudaDefaultAllocator) and misses wrapped arrays that lack a live allocator; change the checks to use allocation_kind instead (compare against _AllocationKind.CUDA_MEMPOOL and _AllocationKind.CUDA_DEVICE) and then call the same helpers (is_mempool_access_enabled(value_device, device) and is_peer_access_enabled(value_device, device)) to decide between _ArrayAccessStatus.ACCESSIBLE and _ArrayAccessStatus.INACCESSIBLE; keep the existing managed-memory handling and the UNKNOWN fallback for other allocation kinds.
🧹 Nitpick comments (1)
warp/native/warp.cu (1)
969-976: 💤 Low valueUnreachable code path during capture.
The
is_capturingcheck on line 970 is unreachable because:
- If capturing and mempool is not supported, we exit at line 966
- If capturing and mempool is supported but pool doesn't exist, we exit at line 937
- This code path is only reached when not capturing and pool allocation failed
This dead code doesn't cause bugs but adds confusion.
♻️ Suggested simplification
if (!ptr) { - if (is_capturing) { - wp::set_error_string( - "Warp error: cudaMallocManaged is not supported during CUDA graph capture; " - "allocate managed memory before capture or use a device with managed memory pool support" - ); - return NULL; - } - if (!check_cuda(cudaMallocManaged(&ptr, s, cudaMemAttachGlobal))) return NULL;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@warp/native/warp.cu` around lines 969 - 976, The branch checking is_capturing after a failed cudaMallocManaged allocation is unreachable and should be removed; update the code in the cudaMallocManaged failure path (the block that tests if (!ptr)) to drop the is_capturing conditional and simply set a single, clear error via wp::set_error_string and return NULL for the non-capture allocation failure case (leave any earlier capture-related exits intact — e.g., the exits at the mempool checks — and only modify the failing allocation block that references ptr and wp::set_error_string).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@CHANGELOG.md`:
- Around line 64-66: Update the changelog entry that describes CUDA 13
capture-time support to explicitly call out CUDA 12.x behavior: add a short
sentence after the CUDA 13 note stating that CUDA 12.x rejects managed
allocations during CUDA graph capture (i.e., allocation attempts are rejected),
so users know the feature is version-dependent; locate the paragraph referencing
"CUDA 13 capture-time support" (the existing lines mentioning CUDA 13 and
managed memory pools) and append the single-line clarification about CUDA 12.x
rejected managed allocation during capture.
In `@design/hardware-coherent-memory-access.md`:
- Around line 1064-1072: The code fence around the CUDA preprocessor block is
malformed causing markdown parsing issues; wrap the entire preprocessor region
(from `#if CUDART_VERSION >= 13000` through `#endif`) inside a language-tagged
fence (e.g., use ```cpp before `#if` and close with ``` after `#endif`) so the
declarations for `cudaMemPoolProps props = {};`, `props.allocType`,
`props.handleTypes`, `props.location`, and the `cudaMemPoolCreate(&managed_pool,
&props)` call stay inside a single properly tagged code block.
In `@warp/_src/context.py`:
- Around line 3751-3762: The deallocate wrapper raises "unknown provenance" for
null/falsy pointers before calling runtime.core.wp_free_device_managed; change
deallocate(self, ptr, size_in_bytes) to treat null/falsy ptr as a no-op by
returning immediately (i.e., if not ptr: return) so you don't touch
_contexts_by_ptr or raise for null pointers; keep the existing provenance
lookup/pop and rollback logic for non-null pointers and leave the
runtime.core.wp_free_device_managed call as-is for real pointers.
---
Outside diff comments:
In `@warp/_src/context.py`:
- Around line 8172-8213: The CUDA-CUDA branch currently uses
isinstance(allocator, CudaMempoolAllocator/CudaDefaultAllocator) and misses
wrapped arrays that lack a live allocator; change the checks to use
allocation_kind instead (compare against _AllocationKind.CUDA_MEMPOOL and
_AllocationKind.CUDA_DEVICE) and then call the same helpers
(is_mempool_access_enabled(value_device, device) and
is_peer_access_enabled(value_device, device)) to decide between
_ArrayAccessStatus.ACCESSIBLE and _ArrayAccessStatus.INACCESSIBLE; keep the
existing managed-memory handling and the UNKNOWN fallback for other allocation
kinds.
---
Nitpick comments:
In `@warp/native/warp.cu`:
- Around line 969-976: The branch checking is_capturing after a failed
cudaMallocManaged allocation is unreachable and should be removed; update the
code in the cudaMallocManaged failure path (the block that tests if (!ptr)) to
drop the is_capturing conditional and simply set a single, clear error via
wp::set_error_string and return NULL for the non-capture allocation failure case
(leave any earlier capture-related exits intact — e.g., the exits at the mempool
checks — and only modify the failing allocation block that references ptr and
wp::set_error_string).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 0a0f5af7-3af6-4f35-9a8f-c9230f56d29c
📒 Files selected for processing (14)
CHANGELOG.mddesign/hardware-coherent-memory-access.mddocs/api_reference/warp.rstdocs/deep_dive/allocators.rstdocs/deep_dive/memory_access.rstwarp/__init__.pywarp/__init__.pyiwarp/_src/context.pywarp/_src/types.pywarp/native/warp.cppwarp/native/warp.cuwarp/native/warp.hwarp/tests/cuda/test_unified_memory.pywarp/tests/test_allocator.py
e428614 to
e233924
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@warp/native/warp.cu`:
- Around line 919-928: When capture is external the managed pool may never be
created and first managed allocation (wp_alloc_device_managed) will fail; fix
this by eagerly initializing the managed pool when an external capture is
registered (or else reject external-capture+managed allocations up front). In
practice, add a call to the same pool-creation/warm-up routine used by
wp_cuda_graph_begin_capture (e.g., invoke the create/get-managed-pool helper
that get_managed_pool relies on) during external-capture registration or at the
start of the external-capture branch, check for errors and call
wp::set_error_string with a clear message if creation fails, and ensure
wp_alloc_device_managed can rely on get_managed_pool returning non-NULL
thereafter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 069bd2a7-978c-4e97-99cd-b5c15dfc49ea
📒 Files selected for processing (14)
CHANGELOG.mddesign/hardware-coherent-memory-access.mddocs/api_reference/warp.rstdocs/deep_dive/allocators.rstdocs/deep_dive/memory_access.rstwarp/__init__.pywarp/__init__.pyiwarp/_src/context.pywarp/_src/types.pywarp/native/warp.cppwarp/native/warp.cuwarp/native/warp.hwarp/tests/cuda/test_unified_memory.pywarp/tests/test_allocator.py
✅ Files skipped from review due to trivial changes (2)
- CHANGELOG.md
- docs/deep_dive/memory_access.rst
🚧 Files skipped from review as they are similar to previous changes (6)
- warp/init.pyi
- docs/deep_dive/allocators.rst
- warp/_src/types.py
- warp/init.py
- warp/tests/test_allocator.py
- warp/_src/context.py
9dbd5bf to
c91dd08
Compare
Greptile SummaryThis PR introduces explicit CUDA managed-memory (UVM) allocation support via
Confidence Score: 5/5Safe to merge; the two-level managed-allocation tracking, context-destruction cleanup, and access-classification logic are all correct. The managed-allocation lifecycle — allocation, deallocation, context-destruction cleanup, and deferred-free during graph capture — is handled correctly at every path. The mark_context_managed_direct_allocs_released call precedes cuCtxDestroy_f, the g_graph_allocs erase-on-failure invariant is restored, the CUDA pointer-attribute queries for memory kind are accurate, and the access-classification logic for managed vs. device vs. mempool memory follows CUDA semantics. Test coverage spans the full feature surface including edge cases such as pre-allocated managed arrays in graph capture, cross-device managed access, and IPC rejection. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant PY as Python (ManagedAllocator)
participant RT as Runtime (context.py)
participant CU as warp.cu
participant CUDA as CUDA Driver
Note over PY,CUDA: Allocation path
PY->>RT: allocate(size_in_bytes)
RT->>RT: wp_cuda_context_get_current() → context
RT->>CU: wp_alloc_device_managed(context, size, tag)
CU->>CU: ContextGuard push(context)
CU->>CU: check is_capturing → reject if true
CU->>CUDA: cudaMallocManaged
CUDA-->>CU: ptr
CU->>CU: "g_managed_alloc_mutex: g_managed_direct_allocs[ptr] = owner_ctx"
CU-->>RT: ptr
RT->>PY: "_contexts_by_ptr[ptr] = context"
Note over PY,CUDA: Deallocation path
PY->>RT: deallocate(ptr, size)
RT->>RT: _contexts_by_ptr.pop(ptr) → context
RT->>CU: wp_free_device_managed(context, ptr)
CU->>CU: g_managed_alloc_mutex: verify and erase g_managed_direct_allocs[ptr]
alt g_captures.empty()
CU->>CUDA: cudaFree(ptr)
else capture active
CU->>CU: deferred_free(ptr, context, false)
end
CU-->>RT: true
Note over PY,CUDA: Context destruction path
CU->>CU: wp_cuda_context_destroy(ctx)
CU->>CU: mark_context_managed_direct_allocs_released(ctx)
CU->>CU: move ptrs to g_destroyed_context_managed_direct_allocs
CU->>CUDA: cuCtxDestroy_f(ctx)
Note over PY,CU: Later: Python GC calls deallocate(ptr)
PY->>CU: wp_free_device_managed(context, ptr)
CU->>CU: find ptr in g_destroyed_context_managed_direct_allocs → erase, return true
Reviews (26): Last reviewed commit: "Add ManagedAllocator support" | Re-trigger Greptile |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@design/hardware-coherent-memory-access.md`:
- Around line 1074-1078: Add a short note to the managed-pool behavior for
external CUDA graph capture: explain that when capture is started via
wp_cuda_graph_begin_capture(..., external=true) the managed-pool
warm-up/initialization is intentionally skipped (see warp/native/warp.cu), so
users must perform a managed allocation or explicitly initialize the managed
pool before starting external capture; add the same caveat at the other
referenced spot (around lines 1229-1230) so both locations warn that external
capture requires pre-capture managed-pool initialization.
In `@warp/_src/context.py`:
- Around line 3699-3715: Update the ManagedAllocator class docstring to
explicitly document the external-capture exception: state that while CUDA 13+
with a managed-memory pool allows allocations during internal graph capture,
captures started with external=True do NOT warm or initialize the managed pool
(see warp/native/warp.cu behavior), so users must pre-initialize or pre-allocate
managed memory (or call allocate() beforehand) before beginning an external
capture; reference ManagedAllocator and its allocate() behavior and the
requirement to allocate managed memory prior to external capture.
- Around line 8189-8197: The code incorrectly treats CUDA_MANAGED arrays as
inaccessible from the CPU when the device lacks concurrent/no-gpu-memory-access
flags; change the predicate in the device.is_cpu && value_device.is_cuda branch
so that AllocationKind.CUDA_MANAGED is considered ACCESSIBLE (migration-backed
access is allowed) instead of returning _ArrayAccessStatus.INACCESSIBLE when
value_device.is_concurrent_managed_access_supported and
value_device.is_gpu_memory_access_from_cpu_supported are false; update the logic
around AllocationKind.CUDA_MANAGED and the use of
value_device.is_concurrent_managed_access_supported /
value_device.is_gpu_memory_access_from_cpu_supported so managed memory falls
through to _ArrayAccessStatus.ACCESSIBLE (so wp.can_access("cpu", arr) and
CHECKED CPU launches succeed).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: e289cea2-9c13-47de-a5ea-8ee09c7d8c6e
📒 Files selected for processing (14)
CHANGELOG.mddesign/hardware-coherent-memory-access.mddocs/api_reference/warp.rstdocs/deep_dive/allocators.rstdocs/deep_dive/memory_access.rstwarp/__init__.pywarp/__init__.pyiwarp/_src/context.pywarp/_src/types.pywarp/native/warp.cppwarp/native/warp.cuwarp/native/warp.hwarp/tests/cuda/test_unified_memory.pywarp/tests/test_allocator.py
✅ Files skipped from review due to trivial changes (3)
- docs/api_reference/warp.rst
- docs/deep_dive/allocators.rst
- docs/deep_dive/memory_access.rst
🚧 Files skipped from review as they are similar to previous changes (6)
- warp/init.py
- warp/native/warp.h
- warp/native/warp.cpp
- warp/init.pyi
- warp/_src/types.py
- warp/native/warp.cu
4cb082c to
667dda8
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
design/hardware-coherent-memory-access.md (1)
1229-1230:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDocument the external-capture managed-pool caveat explicitly.
The graph-capture behavior description should distinguish between internal (Warp-initiated) and external (user-initiated) CUDA graph capture. For external capture started via
wp_cuda_graph_begin_capture(..., external=true), managed-pool warm-up is intentionally skipped and must be performed by the user before starting capture (e.g., by performing a managed allocation or explicit pool initialization).📝 Suggested addition
Add a note after line 1230:
- Graph-capture behavior is documented: managed-pool allocations are graph-capturable on CUDA 13.0+ builds; direct `cudaMallocManaged` fallback allocations must happen before capture. + For external CUDA graph capture (user-initiated capture rather than Warp-initiated capture), managed-pool initialization is not performed automatically at capture registration time; users must perform a managed allocation or explicitly initialize the managed pool before starting external capture.The same caveat should be added around lines 1074-1078 where managed-pool creation is first described.
Based on learnings: in
warp/native/warp.cu, managed pool warm-up is intentionally skipped forwp_cuda_graph_begin_capture(..., external=true)and must be done pre-capture by the user.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@design/hardware-coherent-memory-access.md` around lines 1229 - 1230, Add an explicit note that for external CUDA graph capture (invoked via wp_cuda_graph_begin_capture(..., external=true)) the managed-pool warm-up is intentionally skipped and therefore users must perform managed-pool initialization or a managed allocation before starting external capture; update the graph-capture behavior paragraph to state this caveat and also add the same note where managed-pool creation is first described (the section that introduces managed-pool creation and warm-up). Reference the implementation detail in warp/native/warp.cu that causes this behavior (managed pool warm-up is skipped when external=true) so readers can correlate docs with the wp_cuda_graph_begin_capture behavior.
🧹 Nitpick comments (1)
warp/tests/test_allocator.py (1)
365-377: ⚡ Quick winMake the device-independent checks plain
unittestmethods.
test_managed_allocator_constructs_without_current_context()andtest_managed_allocator_deallocate_null_is_noop()do not need per-device execution, but they are registered throughadd_function_test()acrosscuda_test_devices. Converting them into regularTestCustomAllocatormethods will avoid redundant runs and keep them covered on builders without CUDA devices.As per coding guidelines, "Use standard
unittest.TestCasemethods when tests target a fixed device (e.g., CPU-only). Useadd_function_test()only when tests need to run across multiple devices viaget_test_devices()."Also applies to: 488-494, 564-604
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@warp/tests/test_allocator.py` around lines 365 - 377, Convert the device-independent tests into plain unittest methods: move test_managed_allocator_constructs_without_current_context and test_managed_allocator_deallocate_null_is_noop (and the other tests mentioned around those ranges) out of the add_function_test registrations and implement them as regular methods on TestCustomAllocator so they run once as standard unittest.TestCase methods; remove their registrations that call add_function_test(...) with cuda_test_devices and ensure the tests keep using the same helpers (e.g., wp.ManagedAllocator, _get_allocator_kind, warp_context.runtime.core.wp_cuda_context_set_current) but no longer rely on per-device fixtures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@warp/tests/test_allocator.py`:
- Around line 383-385: Remove the precondition that skips the test when
device.is_managed_memory_supported is false so the test exercises
ManagedAllocator.allocate()'s check for an active CUDA context; specifically,
delete the conditional block that calls test.skipTest(f"{device} does not
support CUDA managed memory") (the check referencing
device.is_managed_memory_supported) in warp/tests/test_allocator.py so the
precondition test runs regardless of managed-memory support and allows
ManagedAllocator.allocate() to perform its own checks.
---
Duplicate comments:
In `@design/hardware-coherent-memory-access.md`:
- Around line 1229-1230: Add an explicit note that for external CUDA graph
capture (invoked via wp_cuda_graph_begin_capture(..., external=true)) the
managed-pool warm-up is intentionally skipped and therefore users must perform
managed-pool initialization or a managed allocation before starting external
capture; update the graph-capture behavior paragraph to state this caveat and
also add the same note where managed-pool creation is first described (the
section that introduces managed-pool creation and warm-up). Reference the
implementation detail in warp/native/warp.cu that causes this behavior (managed
pool warm-up is skipped when external=true) so readers can correlate docs with
the wp_cuda_graph_begin_capture behavior.
---
Nitpick comments:
In `@warp/tests/test_allocator.py`:
- Around line 365-377: Convert the device-independent tests into plain unittest
methods: move test_managed_allocator_constructs_without_current_context and
test_managed_allocator_deallocate_null_is_noop (and the other tests mentioned
around those ranges) out of the add_function_test registrations and implement
them as regular methods on TestCustomAllocator so they run once as standard
unittest.TestCase methods; remove their registrations that call
add_function_test(...) with cuda_test_devices and ensure the tests keep using
the same helpers (e.g., wp.ManagedAllocator, _get_allocator_kind,
warp_context.runtime.core.wp_cuda_context_set_current) but no longer rely on
per-device fixtures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Enterprise
Run ID: 45d2d2d4-171b-40a9-b486-301f8bb8a45f
📒 Files selected for processing (14)
CHANGELOG.mddesign/hardware-coherent-memory-access.mddocs/api_reference/warp.rstdocs/deep_dive/allocators.rstdocs/deep_dive/memory_access.rstwarp/__init__.pywarp/__init__.pyiwarp/_src/context.pywarp/_src/types.pywarp/native/warp.cppwarp/native/warp.cuwarp/native/warp.hwarp/tests/cuda/test_unified_memory.pywarp/tests/test_allocator.py
✅ Files skipped from review due to trivial changes (3)
- warp/init.pyi
- warp/init.py
- docs/deep_dive/allocators.rst
🚧 Files skipped from review as they are similar to previous changes (7)
- CHANGELOG.md
- warp/_src/types.py
- warp/native/warp.h
- docs/deep_dive/memory_access.rst
- warp/tests/cuda/test_unified_memory.py
- warp/native/warp.cu
- warp/_src/context.py
b3e2f5e to
0b0a513
Compare
e4462aa to
9b93b57
Compare
b6b1aa3 to
9b31179
Compare
| static std::unordered_map<void*, CUcontext> g_managed_direct_allocs; | ||
| static std::mutex g_managed_alloc_mutex; |
There was a problem hiding this comment.
g_managed_direct_allocs not cleared on context destruction — double-free risk
wp_cuda_context_destroy calls cuCtxDestroy_f(ctx), which causes CUDA to implicitly free every cudaMallocManaged allocation owned by that context. However, the corresponding entries in g_managed_direct_allocs are never removed. If any live Python warp.array backed by managed memory from the destroyed context is later GC'd, its ManagedAllocator.deallocate() finalizer calls wp_free_device_managed(context, ptr), which finds the stale entry, passes the context check (direct_iter->second == owner_context still matches the destroyed — but non-NULL — CUcontext value), erases the entry, and then calls cudaFree(ptr) on memory CUDA already freed: undefined behavior / double-free.
This is the same class of bug fixed in the prior round for managed pool allocations via g_managed_pool_allocs. The fix follows the same pattern: inside wp_cuda_context_destroy, after acquiring g_managed_alloc_mutex, iterate g_managed_direct_allocs and erase every entry whose value equals ctx, marking those pointers as already released before cuCtxDestroy_f reclaims them.
Warp's CUDA allocation controls did not expose an explicit opt-in path for managed memory. Access checks also could not inspect the memory class of arrays received from another layer. Add ManagedAllocator and MemoryKind reporting on wp.array so managed arrays can be created intentionally and arrays can report the observed pointer memory class. Managed allocations now use cudaMallocManaged outside CUDA graph capture. Capture-time managed allocation is rejected clearly. The allocator uses the Runtime error-string helper so allocation and deallocation failures report native diagnostics through the same path. Access validation, IPC rejection, docs, and tests now use memory kind rather than allocator identity alone. Default CUDA allocation behavior is unchanged, and unowned memory-pool pointers remain conservative when Warp cannot prove the specific pool access state. Signed-off-by: Eric Shi <ershi@nvidia.com>
9b31179 to
04fef46
Compare
Description
Closes #1523.
This PR adds explicit CUDA managed-memory allocation support while keeping Warp's default CUDA allocation behavior unchanged. Users can opt into managed memory with
wp.ManagedAllocator()through the existing allocator configuration APIs. Warp also exposes memory-kind inspection throughwp.MemoryKindandarray.memory_kind, so managed arrays can be distinguished from host, pinned host, CUDA device, CUDA memory-pool, and unknown allocations.A major reason to opt in is CPU/GPU shared work: on systems where CUDA reports compatible managed-memory access, CPU kernels can directly read and write arrays allocated with
wp.ManagedAllocator()instead of maintaining a separate CPU copy. Ordinary Warp CUDA arrays remain non-managed and still require explicit CPU copies before CPU kernels access them.This shared-memory path covers direct loads and stores under normal Warp synchronization. It does not make overlapping CPU/GPU
wp.atomic_*updates safe; current Warp CPU atomics are not hardware atomics, so CPU/GPU interprocessor atomics remain follow-up work documented in the memory-access deep dive.Managed arrays now participate in
wp.can_access()and checked launch validation through CUDA managed-memory capabilities rather than peer-access or CUDA memory-pool access state. CUDA IPC export rejects managed arrays with a clear error because CUDA IPC handles do not represent managed allocations safely.Managed arrays can be used by CUDA kernels captured in graphs when the arrays are allocated before capture begins. This PR does not support allocating new managed arrays during CUDA graph capture:
wp.ManagedAllocator()usescudaMallocManaged()outside capture on all supported CUDA toolkit builds, including CUDA 13.x builds, and rejects capture-time managed allocation clearly. CUDA 13 managed memory pools and graph-capturable managed allocation are deferred follow-up work.Checklist
Test plan
I validated the scoped CUDA managed-memory path:
cudaMallocManaged()path compiles cleanly, allocates outside graph capture, and rejects managed allocation during graph capture.New feature / enhancement
This PR adds public memory-kind API through
wp.MemoryKindandwp.array.memory_kind.wp.MemoryKind.HOSTwp.MemoryKind.PINNED_HOSTwp.MemoryKind.CUDA_DEVICEwp.MemoryKind.CUDA_MEMPOOLwp.MemoryKind.CUDA_MANAGEDwp.ManagedAllocator()withcudaMallocManaged().wp.MemoryKind.UNKNOWNFor concrete
wp.arrayinstances,array.memory_kindreports the observed memory class of the backing pointer, not the current physical residency of CUDA managed memory and not whether a specific device can access the array. Usewp.can_access()for access checks. Array views report the memory kind of their owner array.wp.indexedarraydoes not expose a scalarmemory_kindbecause it is backed by multiple arrays: the data array plus one or more index arrays. Inspectindexed.data.memory_kindandindexed.indices[i].memory_kindfor constituent diagnostics, or usewp.can_access(device, indexed)for the access decision across all backing arrays.