Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,12 @@
- Extend `wp.utils.array_scan()` to 64-bit scalar and vector types, and extend `wp.utils.radix_sort_pairs()` to 32- and
64-bit signed, unsigned, and floating-point keys with 4- or 8-byte values
([GH-1538](https://github.com/NVIDIA/warp/issues/1538)).
- Add `wp.ManagedAllocator()` for explicit CUDA managed-memory arrays. CPU kernels can use managed arrays as an
opt-in path to read and write CUDA managed-memory allocations through Unified Memory on systems where CUDA reports
compatible managed-memory access, while Warp CUDA arrays backed by non-managed memory still need explicit CPU
copies. Use `array.memory_kind` to inspect whether an array is backed by host, pinned host, CUDA device, CUDA
mempool, or CUDA managed memory. Preallocated managed arrays work in CUDA graph captures, but capture-time allocation
is a current limitation ([GH-1523](https://github.com/NVIDIA/warp/issues/1523)).

### Removed

Expand Down
443 changes: 354 additions & 89 deletions design/hardware-coherent-memory-access.md

Large diffs are not rendered by default.

21 changes: 13 additions & 8 deletions design/pluggable-allocators.md
Original file line number Diff line number Diff line change
Expand Up @@ -216,26 +216,31 @@ internals into the allocator surface.

Current limitation: `wp.can_access(device, array)` and
`warp.config.launch_array_access_mode = wp.config.LaunchArrayAccessMode.CHECKED`
remain conservative for arrays allocated through custom allocators.
remain conservative for arrays allocated through custom allocators when Warp
cannot classify the pointer or prove the relevant access state.
Same-device launches are accepted, but cross-device launches require Warp to
know whether the allocation uses default CUDA memory, CUDA memory pools,
pinned host memory, managed memory, or another memory type. The current custom
allocator protocol only returns a pointer, so cross-device arrays backed by
custom or externally wrapped allocators warn once per launch pattern in checked
pinned host memory, managed memory, or another memory type. CUDA pointer
attributes can classify externally wrapped managed and ordinary CUDA device
pointers so Warp can use managed-memory or peer-access predicates. The current
custom allocator protocol still only returns a pointer, so unclassified
pointers and externally wrapped or custom memory-pool pointers whose specific
pool access state cannot be proven warn once per launch pattern in checked
mode and then proceed. Using `wp.config.LaunchArrayAccessMode.RELAXED` leaves
access legality to the hardware without the diagnostic, matching the default
launch path.

Future solutions must provide enough allocation provenance for
Future solutions must provide enough memory-kind and access metadata for
`wp.can_access(device, array)` and `wp.config.LaunchArrayAccessMode.CHECKED` to
make the same conservative decisions they make for Warp-owned allocations. At a
minimum, Warp needs to distinguish the owning device and memory class for
allocations that participate in cross-device launch verification, including
default CUDA device memory, CUDA memory pools, managed memory, pinned host
memory, and allocator-defined external memory.
CUDA device memory that is neither managed nor memory-pool memory, CUDA
memory pools, managed memory, pinned host memory, and allocator-defined
external memory.

Any future mechanism must remain backward compatible with simple custom
allocators, preserve an "unknown" result when allocation provenance is
allocators, preserve an "unknown" result when memory metadata is
unavailable or unrecognized, and avoid exposing framework-specific internals as
part of the basic allocator surface. It also needs to keep launch verification
compatible with CUDA graph capture and use the same access predicates as
Expand Down
2 changes: 2 additions & 0 deletions docs/api_reference/warp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -385,6 +385,8 @@ CUDA Memory Management
:toctree: _generated

Allocator
ManagedAllocator
MemoryKind
ScopedAllocator
ScopedMempool
ScopedMempoolAccess
Expand Down
126 changes: 126 additions & 0 deletions docs/deep_dive/allocators.rst
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,132 @@ For temporary allocator changes, use the :class:`ScopedAllocator` context manage
a = wp.zeros(1000, dtype=wp.float32, device="cuda:0")
# Original allocator is restored here

.. _managed_memory_allocation_options:

Managed Memory Allocator
~~~~~~~~~~~~~~~~~~~~~~~~

Managed memory is CUDA-managed storage that can be addressed from CPU and GPU
code. CUDA Unified Memory manages page placement and migration, so pages may move
between CPU and GPU memory as different processors touch them. Unlike pinned CPU
memory, which remains host memory that a GPU may access through a host mapping,
managed memory gives Warp arrays a different tradeoff from the other allocation
options:

.. list-table::
:header-rows: 1
:widths: 18 29 27 26

* - Allocation option
- Residency and migration
- CPU/GPU access
- Typical use
* - Default CUDA
- Device memory with no automatic CPU/GPU migration.
- CUDA kernels access it directly; CPU code uses explicit copies.
- General GPU arrays when CPU access is staged explicitly.
* - CUDA mempool
- Device memory from CUDA's stream-ordered pool, with no automatic CPU/GPU
migration.
- Same CPU/GPU access rules as default CUDA memory, with separate
memory-pool access controls for peer GPUs.
- Faster repeated CUDA allocations and graph-captured allocation when
supported.
* - Pinned CPU
- Host memory that does not migrate into device memory as an allocation.
- CPU code accesses it directly; CUDA devices with unified virtual
addressing can access it through a host mapping.
- Asynchronous CPU/GPU copies or zero-copy access to small host-resident
data.
* - CUDA managed
- CUDA Unified Memory whose pages may migrate between CPU and GPU memory.
- CPU and GPU access follow CUDA managed-memory support and synchronization
rules.
- Sharing data across CPU/GPU code when migration is preferable to manual
copies.

:class:`ManagedAllocator` creates CUDA managed-memory arrays through Warp's
allocator interface. Managed arrays keep their CUDA device metadata, but
``wp.can_access()`` and checked launch validation use CUDA managed-memory access
rules for them instead of peer-access or memory-pool-access rules.

One major reason to choose this allocator is CPU/GPU shared work: on systems
where CUDA reports compatible managed-memory access, CPU kernels can directly
read and write managed CUDA arrays instead of maintaining a separate CPU copy.
Standard Warp CUDA arrays remain non-managed and still require explicit copies
before CPU code accesses them.

The allocator object is not bound to one CUDA device and can be constructed
before choosing a CUDA device. Warp invokes it under the target device's CUDA
context, which must support CUDA managed memory, and records that context as
the owner for each pointer:

.. code:: python

managed = wp.ManagedAllocator()
device = wp.get_device("cuda:0")

with wp.ScopedAllocator(device, managed):
a = wp.zeros(1000, dtype=wp.float32, device=device)

Constructing a :class:`ManagedAllocator` does not promise that pages initially
reside in any device's physical memory, and it does not bypass the device's
managed-memory capability check. The CUDA device used for each allocation
identifies the owner context and array device metadata; CUDA Unified Memory
manages physical placement and migration.

Use :attr:`array.memory_kind <warp.array.memory_kind>` to inspect the observed
memory class backing a concrete :class:`warp.array`:

.. code:: python

if a.memory_kind is wp.MemoryKind.CUDA_MANAGED:
...

The memory kind describes the pointer's memory class as observed by Warp, and
for CUDA arrays by CUDA pointer attributes. It does not describe the current
physical residency of CUDA managed memory, and views report the memory kind of
their owner array. Indexed arrays do not expose a single memory kind because
their data and index arrays may have different backing allocations.

To use managed memory as a persistent allocator for all CUDA devices, install one
allocator instance with :func:`set_cuda_allocator`:

.. code:: python

managed = wp.ManagedAllocator()
wp.set_cuda_allocator(managed)

If only some CUDA devices should use managed memory, install the same allocator
with :func:`set_device_allocator` on those devices. A single allocator instance
can serve multiple CUDA devices, but allocation fails clearly on any target
device that does not report CUDA managed-memory support.

Direct calls to ``ManagedAllocator.allocate()`` require an active CUDA context.
Array factory functions such as :func:`zeros` and :func:`empty` pass the target
device context automatically and perform the same managed-memory support check.

Managed allocations currently have a CUDA graph-capture limitation in Warp:
:class:`ManagedAllocator` does not allocate a new array while CUDA graph capture
is active. If you need managed arrays with CUDA graphs, allocate them before
capture begins and reuse the existing arrays inside the captured work. This is
an implementation limitation, not a restriction on using pre-existing managed
arrays in captured work. Separately, :class:`ManagedAllocator`-managed arrays
cannot be exported with ``array.ipc_handle()``; IPC export is unsupported for
managed arrays. If IPC is required, choose a different allocator for shared data
or pre-allocate and export device arrays before switching allocator state.

CPU access to managed arrays is hardware-dependent. Use :func:`can_access` to
check a specific managed array before CPU code reads or writes it directly:

.. code:: python

if wp.can_access("cpu", a):
wp.launch(cpu_kernel, dim=a.size, inputs=[a], device="cpu")
else:
a_cpu = a.to("cpu")
wp.launch(cpu_kernel, dim=a_cpu.size, inputs=[a_cpu], device="cpu")

Writing a Custom Allocator
~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
Loading
Loading