NVIDIA · shi-eric · Jun 10, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -15,6 +15,12 @@
 - Extend `wp.utils.array_scan()` to 64-bit scalar and vector types, and extend `wp.utils.radix_sort_pairs()` to 32- and
   64-bit signed, unsigned, and floating-point keys with 4- or 8-byte values
   ([GH-1538](https://github.com/NVIDIA/warp/issues/1538)).
+- Add `wp.ManagedAllocator()` for explicit CUDA managed-memory arrays. CPU kernels can use managed arrays as an
+  opt-in path to read and write CUDA managed-memory allocations through Unified Memory on systems where CUDA reports
+  compatible managed-memory access, while Warp CUDA arrays backed by non-managed memory still need explicit CPU
+  copies. Use `array.memory_kind` to inspect whether an array is backed by host, pinned host, CUDA device, CUDA
+  mempool, or CUDA managed memory. Preallocated managed arrays work in CUDA graph captures, but capture-time allocation
+  is a current limitation ([GH-1523](https://github.com/NVIDIA/warp/issues/1523)).
 
 ### Removed
 

diff --git a/design/hardware-coherent-memory-access.md b/design/hardware-coherent-memory-access.md
diff --git a/design/pluggable-allocators.md b/design/pluggable-allocators.md
@@ -216,26 +216,31 @@ internals into the allocator surface.
 
 Current limitation: `wp.can_access(device, array)` and
 `warp.config.launch_array_access_mode = wp.config.LaunchArrayAccessMode.CHECKED`
-remain conservative for arrays allocated through custom allocators.
+remain conservative for arrays allocated through custom allocators when Warp
+cannot classify the pointer or prove the relevant access state.
 Same-device launches are accepted, but cross-device launches require Warp to
 know whether the allocation uses default CUDA memory, CUDA memory pools,
-pinned host memory, managed memory, or another memory type. The current custom
-allocator protocol only returns a pointer, so cross-device arrays backed by
-custom or externally wrapped allocators warn once per launch pattern in checked
+pinned host memory, managed memory, or another memory type. CUDA pointer
+attributes can classify externally wrapped managed and ordinary CUDA device
+pointers so Warp can use managed-memory or peer-access predicates. The current
+custom allocator protocol still only returns a pointer, so unclassified
+pointers and externally wrapped or custom memory-pool pointers whose specific
+pool access state cannot be proven warn once per launch pattern in checked
 mode and then proceed. Using `wp.config.LaunchArrayAccessMode.RELAXED` leaves
 access legality to the hardware without the diagnostic, matching the default
 launch path.
 
-Future solutions must provide enough allocation provenance for
+Future solutions must provide enough memory-kind and access metadata for
 `wp.can_access(device, array)` and `wp.config.LaunchArrayAccessMode.CHECKED` to
 make the same conservative decisions they make for Warp-owned allocations. At a
 minimum, Warp needs to distinguish the owning device and memory class for
 allocations that participate in cross-device launch verification, including
-default CUDA device memory, CUDA memory pools, managed memory, pinned host
-memory, and allocator-defined external memory.
+CUDA device memory that is neither managed nor memory-pool memory, CUDA
+memory pools, managed memory, pinned host memory, and allocator-defined
+external memory.
 
 Any future mechanism must remain backward compatible with simple custom
-allocators, preserve an "unknown" result when allocation provenance is
+allocators, preserve an "unknown" result when memory metadata is
 unavailable or unrecognized, and avoid exposing framework-specific internals as
 part of the basic allocator surface. It also needs to keep launch verification
 compatible with CUDA graph capture and use the same access predicates as

diff --git a/docs/api_reference/warp.rst b/docs/api_reference/warp.rst
@@ -385,6 +385,8 @@ CUDA Memory Management
    :toctree: _generated
 
    Allocator
+   ManagedAllocator
+   MemoryKind
    ScopedAllocator
    ScopedMempool
    ScopedMempoolAccess

diff --git a/docs/deep_dive/allocators.rst b/docs/deep_dive/allocators.rst
@@ -311,6 +311,132 @@ For temporary allocator changes, use the :class:`ScopedAllocator` context manage
         a = wp.zeros(1000, dtype=wp.float32, device="cuda:0")
     # Original allocator is restored here
 
+.. _managed_memory_allocation_options:
+
+Managed Memory Allocator
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Managed memory is CUDA-managed storage that can be addressed from CPU and GPU
+code. CUDA Unified Memory manages page placement and migration, so pages may move
+between CPU and GPU memory as different processors touch them. Unlike pinned CPU
+memory, which remains host memory that a GPU may access through a host mapping,
+managed memory gives Warp arrays a different tradeoff from the other allocation
+options:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 18 29 27 26
+
+   * - Allocation option
+     - Residency and migration
+     - CPU/GPU access
+     - Typical use
+   * - Default CUDA
+     - Device memory with no automatic CPU/GPU migration.
+     - CUDA kernels access it directly; CPU code uses explicit copies.
+     - General GPU arrays when CPU access is staged explicitly.
+   * - CUDA mempool
+     - Device memory from CUDA's stream-ordered pool, with no automatic CPU/GPU
+       migration.
+     - Same CPU/GPU access rules as default CUDA memory, with separate
+       memory-pool access controls for peer GPUs.
+     - Faster repeated CUDA allocations and graph-captured allocation when
+       supported.
+   * - Pinned CPU
+     - Host memory that does not migrate into device memory as an allocation.
+     - CPU code accesses it directly; CUDA devices with unified virtual
+       addressing can access it through a host mapping.
+     - Asynchronous CPU/GPU copies or zero-copy access to small host-resident
+       data.
+   * - CUDA managed
+     - CUDA Unified Memory whose pages may migrate between CPU and GPU memory.
+     - CPU and GPU access follow CUDA managed-memory support and synchronization
+       rules.
+     - Sharing data across CPU/GPU code when migration is preferable to manual
+       copies.
+
+:class:`ManagedAllocator` creates CUDA managed-memory arrays through Warp's
+allocator interface. Managed arrays keep their CUDA device metadata, but
+``wp.can_access()`` and checked launch validation use CUDA managed-memory access
+rules for them instead of peer-access or memory-pool-access rules.
+
+One major reason to choose this allocator is CPU/GPU shared work: on systems
+where CUDA reports compatible managed-memory access, CPU kernels can directly
+read and write managed CUDA arrays instead of maintaining a separate CPU copy.
+Standard Warp CUDA arrays remain non-managed and still require explicit copies
+before CPU code accesses them.
+
+The allocator object is not bound to one CUDA device and can be constructed
+before choosing a CUDA device. Warp invokes it under the target device's CUDA
+context, which must support CUDA managed memory, and records that context as
+the owner for each pointer:
+
+.. code:: python
+
+    managed = wp.ManagedAllocator()
+    device = wp.get_device("cuda:0")
+
+    with wp.ScopedAllocator(device, managed):
+        a = wp.zeros(1000, dtype=wp.float32, device=device)
+
+Constructing a :class:`ManagedAllocator` does not promise that pages initially
+reside in any device's physical memory, and it does not bypass the device's
+managed-memory capability check. The CUDA device used for each allocation
+identifies the owner context and array device metadata; CUDA Unified Memory
+manages physical placement and migration.
+
+Use :attr:`array.memory_kind <warp.array.memory_kind>` to inspect the observed
+memory class backing a concrete :class:`warp.array`:
+
+.. code:: python
+
+    if a.memory_kind is wp.MemoryKind.CUDA_MANAGED:
+        ...
+
+The memory kind describes the pointer's memory class as observed by Warp, and
+for CUDA arrays by CUDA pointer attributes. It does not describe the current
+physical residency of CUDA managed memory, and views report the memory kind of
+their owner array. Indexed arrays do not expose a single memory kind because
+their data and index arrays may have different backing allocations.
+
+To use managed memory as a persistent allocator for all CUDA devices, install one
+allocator instance with :func:`set_cuda_allocator`:
+
+.. code:: python
+
+    managed = wp.ManagedAllocator()
+    wp.set_cuda_allocator(managed)
+
+If only some CUDA devices should use managed memory, install the same allocator
+with :func:`set_device_allocator` on those devices. A single allocator instance
+can serve multiple CUDA devices, but allocation fails clearly on any target
+device that does not report CUDA managed-memory support.
+
+Direct calls to ``ManagedAllocator.allocate()`` require an active CUDA context.
+Array factory functions such as :func:`zeros` and :func:`empty` pass the target
+device context automatically and perform the same managed-memory support check.
+
+Managed allocations currently have a CUDA graph-capture limitation in Warp:
+:class:`ManagedAllocator` does not allocate a new array while CUDA graph capture
+is active. If you need managed arrays with CUDA graphs, allocate them before
+capture begins and reuse the existing arrays inside the captured work. This is
+an implementation limitation, not a restriction on using pre-existing managed
+arrays in captured work. Separately, :class:`ManagedAllocator`-managed arrays
+cannot be exported with ``array.ipc_handle()``; IPC export is unsupported for
+managed arrays. If IPC is required, choose a different allocator for shared data
+or pre-allocate and export device arrays before switching allocator state.
+
+CPU access to managed arrays is hardware-dependent. Use :func:`can_access` to
+check a specific managed array before CPU code reads or writes it directly:
+
+.. code:: python
+
+    if wp.can_access("cpu", a):
+        wp.launch(cpu_kernel, dim=a.size, inputs=[a], device="cpu")
+    else:
+        a_cpu = a.to("cpu")
+        wp.launch(cpu_kernel, dim=a_cpu.size, inputs=[a_cpu], device="cpu")
+
 Writing a Custom Allocator
 ~~~~~~~~~~~~~~~~~~~~~~~~~~