Skip to content

Add ManagedAllocator support#1524

Draft
shi-eric wants to merge 1 commit into
NVIDIA:mainfrom
shi-eric:shi-eric/managed-allocator
Draft

Add ManagedAllocator support#1524
shi-eric wants to merge 1 commit into
NVIDIA:mainfrom
shi-eric:shi-eric/managed-allocator

Conversation

@shi-eric

@shi-eric shi-eric commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Description

Closes #1523.

This PR adds explicit CUDA managed-memory allocation support while keeping Warp's default CUDA allocation behavior unchanged. Users can opt into managed memory with wp.ManagedAllocator() through the existing allocator configuration APIs. Warp also exposes memory-kind inspection through wp.MemoryKind and array.memory_kind, so managed arrays can be distinguished from host, pinned host, CUDA device, CUDA memory-pool, and unknown allocations.

A major reason to opt in is CPU/GPU shared work: on systems where CUDA reports compatible managed-memory access, CPU kernels can directly read and write arrays allocated with wp.ManagedAllocator() instead of maintaining a separate CPU copy. Ordinary Warp CUDA arrays remain non-managed and still require explicit CPU copies before CPU kernels access them.

This shared-memory path covers direct loads and stores under normal Warp synchronization. It does not make overlapping CPU/GPU wp.atomic_* updates safe; current Warp CPU atomics are not hardware atomics, so CPU/GPU interprocessor atomics remain follow-up work documented in the memory-access deep dive.

Managed arrays now participate in wp.can_access() and checked launch validation through CUDA managed-memory capabilities rather than peer-access or CUDA memory-pool access state. CUDA IPC export rejects managed arrays with a clear error because CUDA IPC handles do not represent managed allocations safely.

Managed arrays can be used by CUDA kernels captured in graphs when the arrays are allocated before capture begins. This PR does not support allocating new managed arrays during CUDA graph capture: wp.ManagedAllocator() uses cudaMallocManaged() outside capture on all supported CUDA toolkit builds, including CUDA 13.x builds, and rejects capture-time managed allocation clearly. CUDA 13 managed memory pools and graph-capturable managed allocation are deferred follow-up work.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Test plan

I validated the scoped CUDA managed-memory path:

  • Built Warp with CUDA 12.9 and ran the new unified-memory tests to verify the cudaMallocManaged() path compiles cleanly, allocates outside graph capture, and rejects managed allocation during graph capture.
  • Rebuilt Warp with CUDA 13.0 and reran the unified-memory tests to verify the same scoped path on CUDA 13, including use of pre-allocated managed arrays inside CUDA graph capture and rejection of capture-time managed allocation.
  • Ran the allocator tests to check memory-kind reporting across CPU, pinned CPU, default CUDA, CUDA memory-pool, custom, external, and managed allocations.
  • Built the docs to verify the allocator and memory-access documentation changes render cleanly.

New feature / enhancement

This PR adds public memory-kind API through wp.MemoryKind and wp.array.memory_kind.

Value Meaning
wp.MemoryKind.HOST Default host memory.
wp.MemoryKind.PINNED_HOST Pinned or CUDA-registered host memory.
wp.MemoryKind.CUDA_DEVICE CUDA device memory that is not classified as managed or CUDA memory-pool memory.
wp.MemoryKind.CUDA_MEMPOOL CUDA memory-pool allocation managed by Warp.
wp.MemoryKind.CUDA_MANAGED CUDA managed-memory allocation created by wp.ManagedAllocator() with cudaMallocManaged().
wp.MemoryKind.UNKNOWN Memory kind Warp cannot classify.

For concrete wp.array instances, array.memory_kind reports the observed memory class of the backing pointer, not the current physical residency of CUDA managed memory and not whether a specific device can access the array. Use wp.can_access() for access checks. Array views report the memory kind of their owner array.

wp.indexedarray does not expose a scalar memory_kind because it is backed by multiple arrays: the data array plus one or more index arrays. Inspect indexed.data.memory_kind and indexed.indices[i].memory_kind for constituent diagnostics, or use wp.can_access(device, indexed) for the access decision across all backing arrays.

import warp as wp

managed = wp.ManagedAllocator()

with wp.ScopedAllocator("cuda:0", managed):
    a = wp.zeros(1024, dtype=wp.float32, device="cuda:0")

assert a.memory_kind is wp.MemoryKind.CUDA_MANAGED

if wp.can_access("cpu", a):
    wp.launch(cpu_kernel, dim=a.size, inputs=[a], device="cpu")
else:
    a_cpu = a.to("cpu")

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds wp.ManagedAllocator, AllocationKind, array.allocation_kind, device managed-memory capability flags, native managed alloc/free and capture-aware handling, allocation-kind-based access classification and IPC rejection, tests, and documentation/changelog updates.

Changes

Explicit CUDA Managed-Memory Allocation Support

Layer / File(s) Summary
Public API & allocation provenance contracts
warp/__init__.py, warp/__init__.pyi, warp/_src/context.py, warp/_src/types.py, CHANGELOG.md
Expose AllocationKind and ManagedAllocator; add AllocationKind enum and allocator.kind tags; initialize and expose array.allocation_kind; add changelog entry.
ManagedAllocator implementation & runtime wiring
warp/_src/context.py
Implement ManagedAllocator with pointer→context provenance tracking, add ctypes bindings for managed alloc/free and device queries, add Device managed-memory capability fields, and add helpers to derive allocation provenance.
Native managed allocation & graph-capture flow
warp/native/warp.cu, warp/native/warp.cpp, warp/native/warp.h
Add managed-memory capability caching and per-context mempool lifecycle, wp_alloc_device_managed/wp_free_device_managed, pool vs direct tracking, stream-ordered graph alloc bookkeeping, and exported capability getters and no-CUDA stubs.
Native free & capture failure hardening
warp/native/warp.cu
Route frees by allocation origin, harden graph-free logic with early returns on failure, deferred-free bookkeeping, and ensure managed-pool teardown and capture-begin pool creation.
Access classification, diagnostics & IPC restrictions
warp/_src/context.py, warp/_src/types.py
Derive array allocation_kind by following view references; classify access using AllocationKind + device capability flags; include allocation_kind details in strict/checked launch messages; reject IPC for managed/external/view arrays with explicit errors.
Tests: allocator protocol, managed-memory access & capture cases
warp/tests/cuda/test_unified_memory.py, warp/tests/test_allocator.py
Add/extend tests for device capability flags, wp.can_access with managed arrays, ipc_handle() rejection, checked-launch CPU/GPU cases, CUDA graph-capture usage and allocation-during-capture behavior, ManagedAllocator protocol and provenance tests, and test registration.
Design & user documentation
design/hardware-coherent-memory-access.md, docs/api_reference/warp.rst, docs/deep_dive/allocators.rst, docs/deep_dive/memory_access.rst
Document ManagedAllocator semantics, allocation provenance, device capability attributes, wp.can_access rules, CUDA 12.x vs 13.0+ graph-capture constraints, and update compatibility/test guidance.

Sequence Diagram: Managed allocation + launch access check

sequenceDiagram
  participant Python as wp.ManagedAllocator()
  participant Context as WarpContext
  participant Native as wp_alloc_device_managed
  participant Tracker as AllocationTracker
  participant Device as CUDA_Device

  Python->>Context: allocate(bytes)
  Context->>Native: wp_alloc_device_managed(context, size, tag)
  Native->>Device: cudaMallocFromPoolAsync / cudaMallocManaged
  Device-->>Native: ptr
  Native->>Tracker: record(ptr -> owner_context, pool_or_direct)
  Native-->>Context: ptr
  Context-->>Python: array(ptr) with allocation_kind
  Python->>Context: wp.can_access(device, array)
  Context->>Tracker: _get_array_allocation_kind(array)
  Context->>Device: query is_managed_memory_supported / is_concurrent_managed_access_supported
  Context-->>Python: ACCESSIBLE / INACCESSIBLE (with allocation_kind detail)
Loading

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.96% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add ManagedAllocator support' directly and clearly summarizes the primary change—introducing a new ManagedAllocator class for explicit CUDA managed-memory allocation.
Linked Issues check ✅ Passed The pull request comprehensively addresses all coding requirements from issue #1523: implements ManagedAllocator, records allocation provenance via AllocationKind, updates wp.can_access() and checked launch validation, rejects CUDA IPC for managed arrays, supports graph capture for pre-allocated managed arrays, and handles CUDA 12/13 compatibility correctly.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing the ManagedAllocator feature: adding native allocation/free APIs, device capability queries, allocation provenance tracking, access validation updates, documentation, and tests—no unrelated functionality modifications detected.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@shi-eric shi-eric force-pushed the shi-eric/managed-allocator branch 2 times, most recently from 2229a1b to e428614 Compare June 4, 2026 07:49

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
warp/_src/context.py (1)

8172-8213: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use allocation_kind for wrapped CUDA allocations too.

This branch still relies on isinstance(allocator, CudaMempoolAllocator/CudaDefaultAllocator). Wrapped arrays can carry CUDA_MEMPOOL or CUDA_DEVICE provenance without a live allocator object, so wp.can_access() and checked launches will fall through to UNKNOWN instead of using the right mempool/peer-access test.

Suggested fix
-        if isinstance(allocator, CudaMempoolAllocator):
+        if allocation_kind == _AllocationKind.CUDA_MEMPOOL or isinstance(allocator, CudaMempoolAllocator):
             if is_mempool_access_enabled(value_device, device):
                 return _ArrayAccessStatus.ACCESSIBLE
             return _ArrayAccessStatus.INACCESSIBLE
-        if isinstance(allocator, CudaDefaultAllocator):
+        if allocation_kind == _AllocationKind.CUDA_DEVICE or isinstance(allocator, CudaDefaultAllocator):
             if is_peer_access_enabled(value_device, device):
                 return _ArrayAccessStatus.ACCESSIBLE
             return _ArrayAccessStatus.INACCESSIBLE
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@warp/_src/context.py` around lines 8172 - 8213, The CUDA-CUDA branch
currently uses isinstance(allocator, CudaMempoolAllocator/CudaDefaultAllocator)
and misses wrapped arrays that lack a live allocator; change the checks to use
allocation_kind instead (compare against _AllocationKind.CUDA_MEMPOOL and
_AllocationKind.CUDA_DEVICE) and then call the same helpers
(is_mempool_access_enabled(value_device, device) and
is_peer_access_enabled(value_device, device)) to decide between
_ArrayAccessStatus.ACCESSIBLE and _ArrayAccessStatus.INACCESSIBLE; keep the
existing managed-memory handling and the UNKNOWN fallback for other allocation
kinds.
🧹 Nitpick comments (1)
warp/native/warp.cu (1)

969-976: 💤 Low value

Unreachable code path during capture.

The is_capturing check on line 970 is unreachable because:

  1. If capturing and mempool is not supported, we exit at line 966
  2. If capturing and mempool is supported but pool doesn't exist, we exit at line 937
  3. This code path is only reached when not capturing and pool allocation failed

This dead code doesn't cause bugs but adds confusion.

♻️ Suggested simplification
     if (!ptr) {
-        if (is_capturing) {
-            wp::set_error_string(
-                "Warp error: cudaMallocManaged is not supported during CUDA graph capture; "
-                "allocate managed memory before capture or use a device with managed memory pool support"
-            );
-            return NULL;
-        }
-
         if (!check_cuda(cudaMallocManaged(&ptr, s, cudaMemAttachGlobal)))
             return NULL;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@warp/native/warp.cu` around lines 969 - 976, The branch checking is_capturing
after a failed cudaMallocManaged allocation is unreachable and should be
removed; update the code in the cudaMallocManaged failure path (the block that
tests if (!ptr)) to drop the is_capturing conditional and simply set a single,
clear error via wp::set_error_string and return NULL for the non-capture
allocation failure case (leave any earlier capture-related exits intact — e.g.,
the exits at the mempool checks — and only modify the failing allocation block
that references ptr and wp::set_error_string).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.md`:
- Around line 64-66: Update the changelog entry that describes CUDA 13
capture-time support to explicitly call out CUDA 12.x behavior: add a short
sentence after the CUDA 13 note stating that CUDA 12.x rejects managed
allocations during CUDA graph capture (i.e., allocation attempts are rejected),
so users know the feature is version-dependent; locate the paragraph referencing
"CUDA 13 capture-time support" (the existing lines mentioning CUDA 13 and
managed memory pools) and append the single-line clarification about CUDA 12.x
rejected managed allocation during capture.

In `@design/hardware-coherent-memory-access.md`:
- Around line 1064-1072: The code fence around the CUDA preprocessor block is
malformed causing markdown parsing issues; wrap the entire preprocessor region
(from `#if CUDART_VERSION >= 13000` through `#endif`) inside a language-tagged
fence (e.g., use ```cpp before `#if` and close with ``` after `#endif`) so the
declarations for `cudaMemPoolProps props = {};`, `props.allocType`,
`props.handleTypes`, `props.location`, and the `cudaMemPoolCreate(&managed_pool,
&props)` call stay inside a single properly tagged code block.

In `@warp/_src/context.py`:
- Around line 3751-3762: The deallocate wrapper raises "unknown provenance" for
null/falsy pointers before calling runtime.core.wp_free_device_managed; change
deallocate(self, ptr, size_in_bytes) to treat null/falsy ptr as a no-op by
returning immediately (i.e., if not ptr: return) so you don't touch
_contexts_by_ptr or raise for null pointers; keep the existing provenance
lookup/pop and rollback logic for non-null pointers and leave the
runtime.core.wp_free_device_managed call as-is for real pointers.

---

Outside diff comments:
In `@warp/_src/context.py`:
- Around line 8172-8213: The CUDA-CUDA branch currently uses
isinstance(allocator, CudaMempoolAllocator/CudaDefaultAllocator) and misses
wrapped arrays that lack a live allocator; change the checks to use
allocation_kind instead (compare against _AllocationKind.CUDA_MEMPOOL and
_AllocationKind.CUDA_DEVICE) and then call the same helpers
(is_mempool_access_enabled(value_device, device) and
is_peer_access_enabled(value_device, device)) to decide between
_ArrayAccessStatus.ACCESSIBLE and _ArrayAccessStatus.INACCESSIBLE; keep the
existing managed-memory handling and the UNKNOWN fallback for other allocation
kinds.

---

Nitpick comments:
In `@warp/native/warp.cu`:
- Around line 969-976: The branch checking is_capturing after a failed
cudaMallocManaged allocation is unreachable and should be removed; update the
code in the cudaMallocManaged failure path (the block that tests if (!ptr)) to
drop the is_capturing conditional and simply set a single, clear error via
wp::set_error_string and return NULL for the non-capture allocation failure case
(leave any earlier capture-related exits intact — e.g., the exits at the mempool
checks — and only modify the failing allocation block that references ptr and
wp::set_error_string).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 0a0f5af7-3af6-4f35-9a8f-c9230f56d29c

📥 Commits

Reviewing files that changed from the base of the PR and between 23170c6 and 92451ab.

📒 Files selected for processing (14)
  • CHANGELOG.md
  • design/hardware-coherent-memory-access.md
  • docs/api_reference/warp.rst
  • docs/deep_dive/allocators.rst
  • docs/deep_dive/memory_access.rst
  • warp/__init__.py
  • warp/__init__.pyi
  • warp/_src/context.py
  • warp/_src/types.py
  • warp/native/warp.cpp
  • warp/native/warp.cu
  • warp/native/warp.h
  • warp/tests/cuda/test_unified_memory.py
  • warp/tests/test_allocator.py

Comment thread CHANGELOG.md Outdated
Comment thread design/hardware-coherent-memory-access.md Outdated
Comment thread warp/_src/context.py
@shi-eric shi-eric force-pushed the shi-eric/managed-allocator branch from e428614 to e233924 Compare June 4, 2026 07:59

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@warp/native/warp.cu`:
- Around line 919-928: When capture is external the managed pool may never be
created and first managed allocation (wp_alloc_device_managed) will fail; fix
this by eagerly initializing the managed pool when an external capture is
registered (or else reject external-capture+managed allocations up front). In
practice, add a call to the same pool-creation/warm-up routine used by
wp_cuda_graph_begin_capture (e.g., invoke the create/get-managed-pool helper
that get_managed_pool relies on) during external-capture registration or at the
start of the external-capture branch, check for errors and call
wp::set_error_string with a clear message if creation fails, and ensure
wp_alloc_device_managed can rely on get_managed_pool returning non-NULL
thereafter.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 069bd2a7-978c-4e97-99cd-b5c15dfc49ea

📥 Commits

Reviewing files that changed from the base of the PR and between 92451ab and e428614.

📒 Files selected for processing (14)
  • CHANGELOG.md
  • design/hardware-coherent-memory-access.md
  • docs/api_reference/warp.rst
  • docs/deep_dive/allocators.rst
  • docs/deep_dive/memory_access.rst
  • warp/__init__.py
  • warp/__init__.pyi
  • warp/_src/context.py
  • warp/_src/types.py
  • warp/native/warp.cpp
  • warp/native/warp.cu
  • warp/native/warp.h
  • warp/tests/cuda/test_unified_memory.py
  • warp/tests/test_allocator.py
✅ Files skipped from review due to trivial changes (2)
  • CHANGELOG.md
  • docs/deep_dive/memory_access.rst
🚧 Files skipped from review as they are similar to previous changes (6)
  • warp/init.pyi
  • docs/deep_dive/allocators.rst
  • warp/_src/types.py
  • warp/init.py
  • warp/tests/test_allocator.py
  • warp/_src/context.py

Comment thread warp/native/warp.cu Outdated
@shi-eric shi-eric force-pushed the shi-eric/managed-allocator branch 4 times, most recently from 9dbd5bf to c91dd08 Compare June 4, 2026 08:10
@greptile-apps

greptile-apps Bot commented Jun 4, 2026

Copy link
Copy Markdown

Greptile Summary

This PR introduces explicit CUDA managed-memory (UVM) allocation support via wp.ManagedAllocator(), and adds a public wp.MemoryKind enum with array.memory_kind for classifying backing memory. The can_access() and kernel-launch validation paths are updated to use observed memory kind (via CUDA pointer attributes) rather than allocator identity, and a two-level tracking scheme (g_managed_direct_allocs + Python _contexts_by_ptr) ensures safe deallocation through the recorded owner context.

  • wp.ManagedAllocator allocates with cudaMallocManaged(); the C++ layer tracks each pointer's owner CUcontext under g_managed_alloc_mutex, and wp_cuda_context_destroy calls mark_context_managed_direct_allocs_released before cuCtxDestroy_f to prevent double-frees from GC finalizers on destroyed-context pointers.
  • wp.MemoryKind and array.memory_kind use cuPointerGetAttribute (IS_MANAGED, MEMORY_TYPE, MEMPOOL_HANDLE) at query time to classify each pointer, and access-classification in _classify_single_array_access_from_device now branches on memory kind rather than allocator type.
  • ipc_handle() now rejects managed, view, and externally-wrapped arrays with specific errors; CudaMempoolAllocator.deallocate now checks and surfaces the wp_free_device_async return value.

Confidence Score: 5/5

Safe to merge; the two-level managed-allocation tracking, context-destruction cleanup, and access-classification logic are all correct.

The managed-allocation lifecycle — allocation, deallocation, context-destruction cleanup, and deferred-free during graph capture — is handled correctly at every path. The mark_context_managed_direct_allocs_released call precedes cuCtxDestroy_f, the g_graph_allocs erase-on-failure invariant is restored, the CUDA pointer-attribute queries for memory kind are accurate, and the access-classification logic for managed vs. device vs. mempool memory follows CUDA semantics. Test coverage spans the full feature surface including edge cases such as pre-allocated managed arrays in graph capture, cross-device managed access, and IPC rejection.

No files require special attention.

Important Files Changed

Filename Overview
warp/native/warp.cu Core managed-allocation C++ layer: adds g_managed_direct_allocs tracking, wp_alloc_device_managed, wp_free_device_managed, mark_context_managed_direct_allocs_released (called before cuCtxDestroy_f), wp_cuda_pointer_get_memory_kind, and refactors wp_free_device_async to return bool and erase g_graph_allocs entries on every failure path.
warp/_src/context.py Adds ManagedAllocator class, MemoryKind enum, _get_array_memory_kind, and updates _classify_single_array_access_from_device to branch on memory kind rather than allocator type; all access paths for managed, mempool, device, and unknown kinds look correct.
warp/_src/types.py Adds array.memory_kind property, updates array.del to bypass context_guard for ManagedAllocator, and tightens ipc_handle() to reject managed, view, and externally-wrapped arrays with specific errors.
warp/native/warp.h Declares wp_memory_kind enum, wp_alloc_device_managed, wp_free_device_managed, wp_free_device_async (now bool), and new device-capability query functions; all declarations match implementations.
warp/tests/cuda/test_unified_memory.py Adds comprehensive managed-allocator tests: memory_kind queries, can_access semantics, IPC rejection cases, CPU-launch acceptance/rejection, cross-device graph capture with pre-allocated managed arrays, capture-time rejection, and unowned CUDA mempool pointer warning; test coverage looks thorough.
warp/tests/test_allocator.py Adds CPU/pinned/managed/zero-size memory_kind tests and ManagedAllocator allocation/deallocation-from-recorded-context tests; edge cases including zero-size, annotation arrays, and indexedarray are covered.
warp/native/warp.cpp Adds CPU-side stubs for the new managed-allocation and device-capability query functions; straightforward plumbing.
CHANGELOG.md Entry accurately documents ManagedAllocator, memory_kind, UVM CPU-kernel path, and capture-time allocation limitation.

Sequence Diagram

sequenceDiagram
    participant PY as Python (ManagedAllocator)
    participant RT as Runtime (context.py)
    participant CU as warp.cu
    participant CUDA as CUDA Driver

    Note over PY,CUDA: Allocation path
    PY->>RT: allocate(size_in_bytes)
    RT->>RT: wp_cuda_context_get_current() → context
    RT->>CU: wp_alloc_device_managed(context, size, tag)
    CU->>CU: ContextGuard push(context)
    CU->>CU: check is_capturing → reject if true
    CU->>CUDA: cudaMallocManaged
    CUDA-->>CU: ptr
    CU->>CU: "g_managed_alloc_mutex: g_managed_direct_allocs[ptr] = owner_ctx"
    CU-->>RT: ptr
    RT->>PY: "_contexts_by_ptr[ptr] = context"

    Note over PY,CUDA: Deallocation path
    PY->>RT: deallocate(ptr, size)
    RT->>RT: _contexts_by_ptr.pop(ptr) → context
    RT->>CU: wp_free_device_managed(context, ptr)
    CU->>CU: g_managed_alloc_mutex: verify and erase g_managed_direct_allocs[ptr]
    alt g_captures.empty()
        CU->>CUDA: cudaFree(ptr)
    else capture active
        CU->>CU: deferred_free(ptr, context, false)
    end
    CU-->>RT: true

    Note over PY,CUDA: Context destruction path
    CU->>CU: wp_cuda_context_destroy(ctx)
    CU->>CU: mark_context_managed_direct_allocs_released(ctx)
    CU->>CU: move ptrs to g_destroyed_context_managed_direct_allocs
    CU->>CUDA: cuCtxDestroy_f(ctx)
    Note over PY,CU: Later: Python GC calls deallocate(ptr)
    PY->>CU: wp_free_device_managed(context, ptr)
    CU->>CU: find ptr in g_destroyed_context_managed_direct_allocs → erase, return true
Loading

Reviews (26): Last reviewed commit: "Add ManagedAllocator support" | Re-trigger Greptile

Comment thread warp/native/warp.cu Outdated
Comment thread warp/native/warp.cu Outdated
Comment thread warp/_src/types.py Outdated
Comment thread warp/_src/context.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@design/hardware-coherent-memory-access.md`:
- Around line 1074-1078: Add a short note to the managed-pool behavior for
external CUDA graph capture: explain that when capture is started via
wp_cuda_graph_begin_capture(..., external=true) the managed-pool
warm-up/initialization is intentionally skipped (see warp/native/warp.cu), so
users must perform a managed allocation or explicitly initialize the managed
pool before starting external capture; add the same caveat at the other
referenced spot (around lines 1229-1230) so both locations warn that external
capture requires pre-capture managed-pool initialization.

In `@warp/_src/context.py`:
- Around line 3699-3715: Update the ManagedAllocator class docstring to
explicitly document the external-capture exception: state that while CUDA 13+
with a managed-memory pool allows allocations during internal graph capture,
captures started with external=True do NOT warm or initialize the managed pool
(see warp/native/warp.cu behavior), so users must pre-initialize or pre-allocate
managed memory (or call allocate() beforehand) before beginning an external
capture; reference ManagedAllocator and its allocate() behavior and the
requirement to allocate managed memory prior to external capture.
- Around line 8189-8197: The code incorrectly treats CUDA_MANAGED arrays as
inaccessible from the CPU when the device lacks concurrent/no-gpu-memory-access
flags; change the predicate in the device.is_cpu && value_device.is_cuda branch
so that AllocationKind.CUDA_MANAGED is considered ACCESSIBLE (migration-backed
access is allowed) instead of returning _ArrayAccessStatus.INACCESSIBLE when
value_device.is_concurrent_managed_access_supported and
value_device.is_gpu_memory_access_from_cpu_supported are false; update the logic
around AllocationKind.CUDA_MANAGED and the use of
value_device.is_concurrent_managed_access_supported /
value_device.is_gpu_memory_access_from_cpu_supported so managed memory falls
through to _ArrayAccessStatus.ACCESSIBLE (so wp.can_access("cpu", arr) and
CHECKED CPU launches succeed).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: e289cea2-9c13-47de-a5ea-8ee09c7d8c6e

📥 Commits

Reviewing files that changed from the base of the PR and between e428614 and c91dd08.

📒 Files selected for processing (14)
  • CHANGELOG.md
  • design/hardware-coherent-memory-access.md
  • docs/api_reference/warp.rst
  • docs/deep_dive/allocators.rst
  • docs/deep_dive/memory_access.rst
  • warp/__init__.py
  • warp/__init__.pyi
  • warp/_src/context.py
  • warp/_src/types.py
  • warp/native/warp.cpp
  • warp/native/warp.cu
  • warp/native/warp.h
  • warp/tests/cuda/test_unified_memory.py
  • warp/tests/test_allocator.py
✅ Files skipped from review due to trivial changes (3)
  • docs/api_reference/warp.rst
  • docs/deep_dive/allocators.rst
  • docs/deep_dive/memory_access.rst
🚧 Files skipped from review as they are similar to previous changes (6)
  • warp/init.py
  • warp/native/warp.h
  • warp/native/warp.cpp
  • warp/init.pyi
  • warp/_src/types.py
  • warp/native/warp.cu

Comment thread design/hardware-coherent-memory-access.md Outdated
Comment thread warp/_src/context.py
Comment thread warp/_src/context.py Outdated
@shi-eric shi-eric force-pushed the shi-eric/managed-allocator branch 2 times, most recently from 4cb082c to 667dda8 Compare June 4, 2026 15:54
Comment thread warp/native/warp.cu

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
design/hardware-coherent-memory-access.md (1)

1229-1230: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document the external-capture managed-pool caveat explicitly.

The graph-capture behavior description should distinguish between internal (Warp-initiated) and external (user-initiated) CUDA graph capture. For external capture started via wp_cuda_graph_begin_capture(..., external=true), managed-pool warm-up is intentionally skipped and must be performed by the user before starting capture (e.g., by performing a managed allocation or explicit pool initialization).

📝 Suggested addition

Add a note after line 1230:

 - Graph-capture behavior is documented: managed-pool allocations are graph-capturable on CUDA 13.0+ builds; direct `cudaMallocManaged` fallback allocations must happen before capture.
+  For external CUDA graph capture (user-initiated capture rather than Warp-initiated capture), managed-pool initialization is not performed automatically at capture registration time; users must perform a managed allocation or explicitly initialize the managed pool before starting external capture.

The same caveat should be added around lines 1074-1078 where managed-pool creation is first described.

Based on learnings: in warp/native/warp.cu, managed pool warm-up is intentionally skipped for wp_cuda_graph_begin_capture(..., external=true) and must be done pre-capture by the user.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@design/hardware-coherent-memory-access.md` around lines 1229 - 1230, Add an
explicit note that for external CUDA graph capture (invoked via
wp_cuda_graph_begin_capture(..., external=true)) the managed-pool warm-up is
intentionally skipped and therefore users must perform managed-pool
initialization or a managed allocation before starting external capture; update
the graph-capture behavior paragraph to state this caveat and also add the same
note where managed-pool creation is first described (the section that introduces
managed-pool creation and warm-up). Reference the implementation detail in
warp/native/warp.cu that causes this behavior (managed pool warm-up is skipped
when external=true) so readers can correlate docs with the
wp_cuda_graph_begin_capture behavior.
🧹 Nitpick comments (1)
warp/tests/test_allocator.py (1)

365-377: ⚡ Quick win

Make the device-independent checks plain unittest methods.

test_managed_allocator_constructs_without_current_context() and test_managed_allocator_deallocate_null_is_noop() do not need per-device execution, but they are registered through add_function_test() across cuda_test_devices. Converting them into regular TestCustomAllocator methods will avoid redundant runs and keep them covered on builders without CUDA devices.

As per coding guidelines, "Use standard unittest.TestCase methods when tests target a fixed device (e.g., CPU-only). Use add_function_test() only when tests need to run across multiple devices via get_test_devices()."

Also applies to: 488-494, 564-604

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@warp/tests/test_allocator.py` around lines 365 - 377, Convert the
device-independent tests into plain unittest methods: move
test_managed_allocator_constructs_without_current_context and
test_managed_allocator_deallocate_null_is_noop (and the other tests mentioned
around those ranges) out of the add_function_test registrations and implement
them as regular methods on TestCustomAllocator so they run once as standard
unittest.TestCase methods; remove their registrations that call
add_function_test(...) with cuda_test_devices and ensure the tests keep using
the same helpers (e.g., wp.ManagedAllocator, _get_allocator_kind,
warp_context.runtime.core.wp_cuda_context_set_current) but no longer rely on
per-device fixtures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@warp/tests/test_allocator.py`:
- Around line 383-385: Remove the precondition that skips the test when
device.is_managed_memory_supported is false so the test exercises
ManagedAllocator.allocate()'s check for an active CUDA context; specifically,
delete the conditional block that calls test.skipTest(f"{device} does not
support CUDA managed memory") (the check referencing
device.is_managed_memory_supported) in warp/tests/test_allocator.py so the
precondition test runs regardless of managed-memory support and allows
ManagedAllocator.allocate() to perform its own checks.

---

Duplicate comments:
In `@design/hardware-coherent-memory-access.md`:
- Around line 1229-1230: Add an explicit note that for external CUDA graph
capture (invoked via wp_cuda_graph_begin_capture(..., external=true)) the
managed-pool warm-up is intentionally skipped and therefore users must perform
managed-pool initialization or a managed allocation before starting external
capture; update the graph-capture behavior paragraph to state this caveat and
also add the same note where managed-pool creation is first described (the
section that introduces managed-pool creation and warm-up). Reference the
implementation detail in warp/native/warp.cu that causes this behavior (managed
pool warm-up is skipped when external=true) so readers can correlate docs with
the wp_cuda_graph_begin_capture behavior.

---

Nitpick comments:
In `@warp/tests/test_allocator.py`:
- Around line 365-377: Convert the device-independent tests into plain unittest
methods: move test_managed_allocator_constructs_without_current_context and
test_managed_allocator_deallocate_null_is_noop (and the other tests mentioned
around those ranges) out of the add_function_test registrations and implement
them as regular methods on TestCustomAllocator so they run once as standard
unittest.TestCase methods; remove their registrations that call
add_function_test(...) with cuda_test_devices and ensure the tests keep using
the same helpers (e.g., wp.ManagedAllocator, _get_allocator_kind,
warp_context.runtime.core.wp_cuda_context_set_current) but no longer rely on
per-device fixtures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 45d2d2d4-171b-40a9-b486-301f8bb8a45f

📥 Commits

Reviewing files that changed from the base of the PR and between 4cb082c and 667dda8.

📒 Files selected for processing (14)
  • CHANGELOG.md
  • design/hardware-coherent-memory-access.md
  • docs/api_reference/warp.rst
  • docs/deep_dive/allocators.rst
  • docs/deep_dive/memory_access.rst
  • warp/__init__.py
  • warp/__init__.pyi
  • warp/_src/context.py
  • warp/_src/types.py
  • warp/native/warp.cpp
  • warp/native/warp.cu
  • warp/native/warp.h
  • warp/tests/cuda/test_unified_memory.py
  • warp/tests/test_allocator.py
✅ Files skipped from review due to trivial changes (3)
  • warp/init.pyi
  • warp/init.py
  • docs/deep_dive/allocators.rst
🚧 Files skipped from review as they are similar to previous changes (7)
  • CHANGELOG.md
  • warp/_src/types.py
  • warp/native/warp.h
  • docs/deep_dive/memory_access.rst
  • warp/tests/cuda/test_unified_memory.py
  • warp/native/warp.cu
  • warp/_src/context.py

Comment thread warp/tests/test_allocator.py Outdated
@shi-eric shi-eric force-pushed the shi-eric/managed-allocator branch 2 times, most recently from b3e2f5e to 0b0a513 Compare June 4, 2026 18:35
Comment thread warp/native/warp.cu Outdated
@shi-eric shi-eric force-pushed the shi-eric/managed-allocator branch 10 times, most recently from e4462aa to 9b93b57 Compare June 7, 2026 18:54
@shi-eric shi-eric force-pushed the shi-eric/managed-allocator branch 12 times, most recently from b6b1aa3 to 9b31179 Compare June 10, 2026 01:29
Comment thread warp/native/warp.cu
Comment on lines +250 to +251
static std::unordered_map<void*, CUcontext> g_managed_direct_allocs;
static std::mutex g_managed_alloc_mutex;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 g_managed_direct_allocs not cleared on context destruction — double-free risk

wp_cuda_context_destroy calls cuCtxDestroy_f(ctx), which causes CUDA to implicitly free every cudaMallocManaged allocation owned by that context. However, the corresponding entries in g_managed_direct_allocs are never removed. If any live Python warp.array backed by managed memory from the destroyed context is later GC'd, its ManagedAllocator.deallocate() finalizer calls wp_free_device_managed(context, ptr), which finds the stale entry, passes the context check (direct_iter->second == owner_context still matches the destroyed — but non-NULL — CUcontext value), erases the entry, and then calls cudaFree(ptr) on memory CUDA already freed: undefined behavior / double-free.

This is the same class of bug fixed in the prior round for managed pool allocations via g_managed_pool_allocs. The fix follows the same pattern: inside wp_cuda_context_destroy, after acquiring g_managed_alloc_mutex, iterate g_managed_direct_allocs and erase every entry whose value equals ctx, marking those pointers as already released before cuCtxDestroy_f reclaims them.

Warp's CUDA allocation controls did not expose an explicit opt-in path
for managed memory. Access checks also could not inspect the memory
class of arrays received from another layer. Add ManagedAllocator and
MemoryKind reporting on wp.array so managed arrays can be created
intentionally and arrays can report the observed pointer memory class.

Managed allocations now use cudaMallocManaged outside CUDA graph
capture. Capture-time managed allocation is rejected clearly. The
allocator uses the Runtime error-string helper so allocation and
deallocation failures report native diagnostics through the same path.

Access validation, IPC rejection, docs, and tests now use memory kind
rather than allocator identity alone. Default CUDA allocation behavior
is unchanged, and unowned memory-pool pointers remain conservative when
Warp cannot prove the specific pool access state.

Signed-off-by: Eric Shi <ershi@nvidia.com>
@shi-eric shi-eric force-pushed the shi-eric/managed-allocator branch from 9b31179 to 04fef46 Compare June 10, 2026 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support explicit CUDA managed-memory allocation

1 participant