diff --git a/CHANGELOG.md b/CHANGELOG.md index 40fc1bdb09..d7f458cedf 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,6 +15,12 @@ - Extend `wp.utils.array_scan()` to 64-bit scalar and vector types, and extend `wp.utils.radix_sort_pairs()` to 32- and 64-bit signed, unsigned, and floating-point keys with 4- or 8-byte values ([GH-1538](https://github.com/NVIDIA/warp/issues/1538)). +- Add `wp.ManagedAllocator()` for explicit CUDA managed-memory arrays. CPU kernels can use managed arrays as an + opt-in path to read and write CUDA managed-memory allocations through Unified Memory on systems where CUDA reports + compatible managed-memory access, while Warp CUDA arrays backed by non-managed memory still need explicit CPU + copies. Use `array.memory_kind` to inspect whether an array is backed by host, pinned host, CUDA device, CUDA + mempool, or CUDA managed memory. Preallocated managed arrays work in CUDA graph captures, but capture-time allocation + is a current limitation ([GH-1523](https://github.com/NVIDIA/warp/issues/1523)). ### Removed diff --git a/design/hardware-coherent-memory-access.md b/design/hardware-coherent-memory-access.md index 30e0aa5bf3..bb8315d201 100644 --- a/design/hardware-coherent-memory-access.md +++ b/design/hardware-coherent-memory-access.md @@ -7,12 +7,12 @@ - Phase 1: [GH-1461](https://github.com/NVIDIA/warp/issues/1461) - Future phases: Track with follow-up GitHub issues as they are scheduled. -**Implementation status**: Phase 1 is implemented. Phases 2--5 remain future -work. +**Implementation status**: Phases 1 and 5 are implemented. Phases 2--4, 6, +and 7 remain future work. ## Motivation -Warp currently enforces a strict rule: every array argument passed to `wp.launch()` must reside on the same device as the kernel launch target. If a user creates an array on the CPU and attempts to launch a GPU kernel that reads it, Warp raises a `RuntimeError`. This enforcement exists in `warp/_src/context.py::pack_arg`: +Before Phase 1, Warp enforced a strict rule: every array argument passed to `wp.launch()` had to reside on the same device as the kernel launch target. If a user created an array on the CPU and attempted to launch a GPU kernel that read it, Warp raised a `RuntimeError`. This enforcement existed in `warp/_src/context.py::pack_arg`: ```python # check device @@ -28,7 +28,7 @@ This restriction is correct on discrete-GPU systems (e.g., a workstation with a - **Grace C2C systems (GH200, GB200, DGX Spark)** -- Grace ARM CPU + Hopper or Blackwell GPU connected via NVLink Chip-to-Chip (C2C). These systems can report host-page-table ATS, allowing the GPU to access ordinary system memory. CPU direct access to GPU-resident CUDA managed memory depends on `cudaDevAttrDirectManagedMemAccessFromHost`; do not assume it from the product family name. - **Jetson Orin and other limited Tegra systems** -- Integrated GPUs sharing the same DRAM as the CPU, but with a limited unified memory model where ordinary system allocations are not necessarily GPU-accessible. -- **Jetson Thor** -- Tegra Blackwell SoC with CUDA-reported ATS. On a Thor development kit tested with CUDA 13.0, the GPU can directly access ordinary system allocations (`malloc`, anonymous `mmap`, and file-backed `mmap`) and host-native atomics work, but CPU direct access to `cudaMalloc` memory is still not supported. +- **Jetson Thor** -- Tegra Blackwell SoC with CUDA-reported ATS. On a Thor development kit tested with CUDA 13.0, the GPU can directly access ordinary system allocations (`malloc`, anonymous `mmap`, and file-backed `mmap`) and the CUDA hardware reports host-native atomic support, but CPU direct access to `cudaMalloc` memory is still not supported. Current Warp CPU atomics do not provide a CPU/GPU interprocessor atomic contract. - **HMM-capable discrete systems** -- Linux kernel 6.1.24+ with Heterogeneous Memory Management (HMM) enabled allows software-coherent access to all system memory from PCIe GPUs, without requiring explicit CUDA allocation APIs. On all systems where the CUDA device reports `CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS`, the strict `value.device != device` check is overly conservative and forces users into unnecessary `wp.copy()` or `.to(device)` calls that are both a performance penalty and an ergonomic burden. On HMM and ATS systems in particular, a plain `malloc`'d pointer is directly accessible from the GPU -- there is no need to copy data at all. @@ -42,7 +42,7 @@ data = wp.array([1.0, 2.0, 3.0], device="cpu") wp.launch(my_kernel, dim=3, inputs=[data], device="cuda:0") ``` -gets a `RuntimeError` even though the hardware can handle this directly. The user must write: +previously got a `RuntimeError` even though the hardware can handle this directly. The user had to write: ```python data = wp.array([1.0, 2.0, 3.0], device="cpu") @@ -63,7 +63,9 @@ CUDA exposes unified memory capabilities through device attributes. The sections Applies to Windows systems including WSL and to Tegra/Jetson devices whose CUDA attributes report limited managed access. Do not infer this from the Jetson family name alone: Jetson Thor tested with CUDA 13.0 reports `concurrentManagedAccess == 1`, `pageableMemoryAccess == 1`, and `pageableMemoryAccessUsesHostPageTables == 1`, so it does not fall into this paradigm. Characteristics: -- Only memory explicitly allocated via `cudaMallocManaged` (or `cudaMallocFromPoolAsync` with `cudaMemAllocationTypeManaged`, or `__managed__` globals) behaves as unified memory. +- Only memory explicitly allocated via `cudaMallocManaged` (or, on CUDA 13.0+ + builds, `cudaMallocFromPoolAsync` from a managed memory pool, or + `__managed__` globals) behaves as unified memory. - Managed memory starts in CPU physical memory, is bulk-migrated to the GPU when a kernel begins executing, and is bulk-migrated back on synchronization. - The CPU must not access managed memory while the GPU is active. - Oversubscription of GPU memory is not allowed. @@ -105,7 +107,7 @@ Available on Grace Hopper, Grace Blackwell (including DGX Spark), Jetson Thor, a Characteristics: - ALL system-allocated memory is GPU-accessible (same as HMM). - GPU-resident CUDA managed memory is CPU-accessible without migration only when `cudaDevAttrDirectManagedMemAccessFromHost == 1`. This attribute is independent of ATS and must be queried directly. It is false on Jetson Thor as tested with CUDA 13.0, and false on a DGX Spark / GB10 system tested with CUDA Toolkit 13.0 and driver 580.95.05. -- Native CPU-GPU atomics work when `cudaDevAttrHostNativeAtomicSupported == 1`. This is a separate capability bit and does not imply CPU access to `cudaMalloc` allocations. +- `cudaDevAttrHostNativeAtomicSupported == 1` reports a hardware/link capability. This is a separate capability bit and does not imply CPU access to `cudaMalloc` allocations or that current Warp `wp.atomic_*` operations are safe for overlapping CPU/GPU updates. - Host page tables are used for system-memory access. On systems with distinct CPU and GPU memory pools (Grace Hopper / Grace Blackwell), physical placement still matters for performance. On integrated SoCs such as Jetson Thor, the CPU and GPU share a single DRAM pool. - ATS subsumes the system-memory access capabilities of HMM. When ATS is available, HMM is automatically disabled. @@ -119,9 +121,9 @@ The previous version of this document speculated that Jetson Thor would follow t - GPU kernels successfully read and wrote ordinary `malloc`, anonymous `mmap`, file-backed `mmap`, `cudaMallocHost`, `cudaHostRegister`, and `cudaMallocManaged` allocations. - `cudaMemPrefetchAsync` succeeded for both managed memory and ordinary `malloc` memory. - Direct CPU load/store of a `cudaMalloc` pointer faulted, matching `directManagedMemAccessFromHost == 0`. -- A stress test with overlapping CPU atomic increments and GPU `atomicAdd()` produced the exact expected result for ordinary `malloc`, pinned host memory, and managed memory. +- A standalone native stress test with real CPU atomic increments and GPU `atomicAdd()` produced the exact expected result for ordinary `malloc`, pinned host memory, and managed memory. This result does not apply to current Warp CPU `wp.atomic_*` lowering, which is not a hardware atomic. -The implementation must therefore treat "GPU can access system memory", "CPU can access GPU-resident CUDA managed memory", and "native CPU-GPU atomics work" as three independent capabilities. +The implementation must therefore treat "GPU can access system memory", "CPU can access GPU-resident CUDA managed memory", and "the hardware reports native CPU-GPU atomic capability" as three independent capabilities. The third is a diagnostic and future-work input, not a current Warp `wp.atomic_*` guarantee. #### Observed DGX Spark / GB10 Behavior @@ -146,6 +148,19 @@ For Phase 1, the relevant launch feature remains GPU access to CPU arrays via `pageableMemoryAccess`; CPU direct access to GPU-resident managed memory remains attribute-gated and is not used to validate Warp default CUDA arrays. +Further testing on a DGX Spark-class GB10 system on 2026-06-05 showed that +`is_cpu_gpu_atomic_supported == True` must not be treated as a Warp atomic API +contract. Warp's CPU `wp.atomic_*` helpers currently lower to ordinary +read/modify/write operations, which assumes Warp's serial CPU kernel execution +model and is not safe when CPU and GPU work update the same address +concurrently. CUDA-side `wp.atomic_*` operations also use the normal CUDA atomic +implementation and are not documented here as system-scope host/device atomics. +Follow-on work is required before Warp can advertise CPU/GPU interprocessor +atomics. That work should include CPU hardware-atomic lowering for supported +scalar operations, GPU system-scope atomic semantics where needed, and +operation-level CUDA host atomic capability queries such as CUDA 13's +`cudaDeviceGetHostAtomicCapabilities()`. + ### Summary of Access Rules by Paradigm | Allocation type | Limited (Tegra/Win) | Full Managed Only | HMM (Software) | ATS system-memory only (Thor/GB10 observed) | ATS with direct managed host access | @@ -185,7 +200,7 @@ Both DGX Spark / GB10 and Jetson Thor use Blackwell-generation GPUs, but their m | Coherency model | Host-page-table ATS with distinct CPU/GPU memory pools | Host-page-table ATS for system memory on an integrated SoC | | `malloc` GPU-accessible | Yes | Yes | | CPU direct access to GPU-resident CUDA managed memory | No (`directManagedMemAccessFromHost == 0` on CUDA 13.0 / driver 580.95.05) | No (`directManagedMemAccessFromHost == 0` on CUDA 13.0) | -| Native CPU-GPU atomics | Yes | Yes for host-visible memory | +| Native CPU-GPU atomic hardware capability | Reports yes; current Warp CPU/GPU `wp.atomic_*` overlap unsupported | Reports yes; current Warp CPU/GPU `wp.atomic_*` overlap unsupported | | Memory topology | Grace LPDDR5X + Blackwell HBM (NUMA) | Single shared DRAM pool | | Unified memory paradigm | ATS system-memory access (Paradigm 4) | ATS system-memory access (Paradigm 4) | | Best default allocator | System allocator (`malloc`) for shared CPU/GPU data | System allocator (`malloc`) for CPU-produced GPU-readable data; `cudaMalloc` for GPU-private data | @@ -202,13 +217,17 @@ This means the implementation must query capabilities independently instead of a | R4 | Provide `wp.prefetch()` API for explicit data migration hints | Should | Performance optimization for HMM / host-page-table ATS | | R5 | Optional automatic prefetch in `wp.launch()` for cross-device arrays on coherent systems | Could | Convenience, but needs careful defaults | | R6 | `wp.copy()` should skip staging buffers when direct access is available between devices | Could | Performance optimization, marked as TODO in current code | +| R7 | Provide explicit managed-memory arrays through a built-in allocator | Should | `wp.ManagedAllocator()` integrates with existing allocator APIs and records owner-context metadata | +| R8 | Support graph-capturable managed allocation on CUDA 13+ builds when managed memory pools are available | Could | Future managed-pool backend for `wp.ManagedAllocator()`; CUDA 12.x remains direct `cudaMallocManaged()` outside capture only | **Non-goals:** -- Changing the default allocator strategy (e.g., using `cudaMallocManaged` by default on limited Tegra systems). Allocator selection is a separate concern. -- Changing CUDA graph capture semantics. Phase 1 supports using `launch_array_access_mode` during graph capture, but does not add new cross-device synchronization, placement, or capture-time migration behavior beyond the same access checks used for ordinary launches. +- Changing the default allocator strategy. Managed memory remains opt-in through `wp.ManagedAllocator()`; standard CUDA arrays continue to use Warp's default CUDA or CUDA memory-pool allocators. +- Changing CUDA graph capture semantics for cross-device access checks. Phase 1 supports using `launch_array_access_mode` during graph capture, but does not add new cross-device synchronization, placement, or capture-time migration behavior beyond the same access checks used for ordinary launches. Phase 6 separately tracks a managed allocation backend that can be recorded in CUDA graphs. - Automatically determining the optimal physical placement for every array. This is a performance tuning concern best left to the user via hints. - Proactively detecting and warning about cross-device launches at `wp.launch()` time. The hardware enforces access rules; the verification mode is available for diagnosis when needed. - Providing a top-level device-to-device access wrapper. `wp.can_access(device, resource)` is a resource-oriented API; `wp.can_access(device, device)` is not supported. Device-level/default-allocation checks remain available as `Device.can_access(other_device)`. +- Adding a custom/external allocation metadata protocol in the managed-memory phase. CUDA pointer attributes classify external pointers where possible, but unclassified pointers and unowned memory-pool pointers remain conservative until a later metadata phase. +- Providing CPU/GPU interprocessor atomics through `wp.atomic_*`. Current CPU-side Warp atomics are ordinary updates under the serial CPU execution model, and CUDA-side Warp atomics are not specified as system-scope host/device operations. A future API or mode may add this once the required CPU lowering, GPU scope, and operation-level capability checks are designed. ## Design @@ -225,10 +244,15 @@ Warp currently supports building with CUDA 12.0 through 13.2. The default toolki | `CU_DEVICE_ATTRIBUTE_HOST_NATIVE_ATOMIC_SUPPORTED` | 86 | CUDA 8.0 | 1 | | `CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS_USES_HOST_PAGE_TABLES` | 100 | CUDA 9.2 | 2 | | `CU_DEVICE_ATTRIBUTE_INTEGRATED` | 18 | CUDA 2.0 | 3 | +| `CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY` | 83 | CUDA 6.0 | 5 | | `CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS` | 89 | CUDA 8.0 | 5 | All predate Warp's minimum of CUDA 12.0, so no `#if CUDA_VERSION` compile-time guards are needed for attribute queries. The attributes are queried via `cuDeviceGetAttribute`, which Warp already loads dynamically via `cuGetProcAddress` at version 2000. The driver returns 0 for any attribute the hardware does not support, which is the correct "feature not available" default. +**Driver API entry-point policy:** Warp loads CUDA Driver API functions dynamically through `cuGetProcAddress`. When a Driver API has multiple versioned entry points, request the oldest entry-point version whose signature and semantics satisfy Warp's use case. This keeps binaries built with newer toolkits compatible with older supported drivers instead of accidentally depending on a newer ABI variant selected by toolkit header macros. + +Use a newer Driver API entry point only when Warp needs newer semantics, such as a new parameter type or behavior that the older entry point cannot express. Guard references to newer header-only types, enum values, or functions with `#if CUDA_VERSION` when they are absent from older supported toolkits, and separately gate runtime loading/calls on the driver version. This policy does not permit emulating feature enum values that older CUDA versions document as unsupported; for example, CUDA managed memory pools require CUDA 13+ headers and runtime support. + **`cuMemPrefetchAsync` (Phase 2):** This driver API has two versions: | API version | Signature | Toolkit requirement | Driver requirement | @@ -236,19 +260,28 @@ All predate Warp's minimum of CUDA 12.0, so no `#if CUDA_VERSION` compile-time g | v1 (version 8000) | `(CUdeviceptr, size_t, CUdevice, CUstream)` | CUDA 8.0+ | CUDA 8.0+ driver | | v2 (version 12080) | `(CUdeviceptr, size_t, CUmemLocation, unsigned int, CUstream)` | CUDA 12.8+ | CUDA 12.8+ driver | -In CUDA 13.0 headers, `cuMemPrefetchAsync` is `#define`'d to `cuMemPrefetchAsync_v2`. Warp must handle both via `cuGetProcAddress` dynamic dispatch, following the existing pattern used for `cuMemcpyBatchAsync`. The v1 API is sufficient for all planned use cases. The v2 API adds NUMA node targeting but is not required. When compiled with CUDA 12.0--12.7, only v1 is available; this is fine. See Phase 2 for the full dispatch implementation. +In CUDA 13.0 headers, `cuMemPrefetchAsync` is `#define`'d to `cuMemPrefetchAsync_v2`. Warp should avoid that macro-selected newer ABI and explicitly request the v1 entry point via `cuGetProcAddress` because v1 is sufficient for all planned use cases. The v2 API adds NUMA node targeting but is not required. See Phase 2 for the full dispatch implementation. + +**Managed allocation APIs (Phase 5):** `wp.ManagedAllocator()` uses `cudaMallocManaged(..., cudaMemAttachGlobal)` outside CUDA graph capture. The allocator must explicitly avoid calling `cudaMallocManaged` during stream capture: local testing on driver 580.126.20 showed `cudaMallocManaged` returns `cudaErrorStreamCaptureUnsupported` and invalidates capture. Managed arrays allocated before capture can still be used by captured kernels. + +**Managed memory-pool APIs (Phase 6):** CUDA 13.0 adds the managed memory-pool allocation type needed for graph-capturable managed allocation. On CUDA 13.0+ builds, devices with memory-pool support can use a private CUDA memory pool whose `cudaMemPoolProps.allocType` is `cudaMemAllocationTypeManaged`, with allocations made through `cudaMallocFromPoolAsync()` on the current Warp stream and freed through `cudaFreeAsync()`. This path is stream ordered and can be recorded in CUDA graphs. + +CUDA 12.x builds, including the CUDA 12.9 PyPI build, must not compile or emulate the managed-pool path: `cudaMemAllocationTypeManaged` is not defined there, and CUDA 12.9 documents `cudaMemPoolProps::allocType` as pinned-only. `cudaMallocAsync()`, `cudaFreeAsync()`, CUDA memory pools, and `cudaMallocFromPoolAsync()` were introduced before Warp's CUDA 12.0 minimum, but the managed pool allocation type itself is CUDA 13-only. Runtime support still needs to be gated by the existing memory-pool support query and by successful creation of the managed pool. + +Local testing on a Blackwell GPU with driver 580.126.20 showed `cudaMallocFromPoolAsync()` from a pool with `cudaMemAllocationTypeManaged` can be captured, instantiated, launched, and synchronized successfully. Pool creation should happen before capture begins. If an allocation during capture finds no initialized managed pool, Warp should reject the allocation clearly rather than attempting a capture-unsafe pool creation or falling back to `cudaMallocManaged()`. **Summary by toolkit version:** | Feature | CUDA 12.0 -- 12.7 | CUDA 12.8 -- 12.9 (PyPI default) | CUDA 13.0+ | |---|---|---|---| | Phase 1 (cross-device launch) | Full support | Full support | Full support | -| Phase 2 (prefetch) | v1 API only | v2 API available | v2 API available | +| Phase 2 (prefetch) | v1 API | v1 API; v2 available but not required | v1 API; v2 available but not required | | Phase 3 (auto-prefetch) | Full support (uses Phase 2 API) | Full support | Full support | | Phase 4 (`wp.copy()` optimization) | Full support | Full support | Full support | -| Phase 5 (expanded resource and allocation metadata) | Full support | Full support | Full support | +| Phase 5 (`wp.ManagedAllocator`) | Direct `cudaMallocManaged` outside capture; capture-time allocation unavailable | Direct `cudaMallocManaged` outside capture; capture-time allocation unavailable | Direct `cudaMallocManaged` outside capture; capture-time allocation unavailable | +| Phase 6 (managed memory pools) | Not available; direct fallback outside capture only | Not available; direct fallback outside capture only | Managed pool when available; direct fallback outside capture otherwise | -No phase requires a minimum toolkit version beyond CUDA 12.0. Degradation on older toolkits only affects which `cuMemPrefetchAsync` signature is available, which is handled transparently by the dynamic dispatch. +No phase requires a minimum toolkit version beyond CUDA 12.0 to compile or expose its public API. The Phase 2 prefetch wrapper uses the v1 Driver API entry point for compatibility; v2 is only needed for future NUMA-node targeting. Phase 5 does not support managed allocation during CUDA graph capture on any toolkit version. Phase 6 adds a CUDA 13-only native backend guarded at compile time and runtime; older builds keep the Phase 5 behavior and reject managed allocation during capture. ### Overview: What Each Phase Introduces @@ -260,7 +293,9 @@ Each phase introduces only the device attributes, native functions, and Python A | 2 | Future | `wp.prefetch()` for explicit data placement | `pageable_memory_access_uses_host_page_tables` (to distinguish HMM from host-page-table ATS for warning/no-op behavior) | `wp_cuda_mem_prefetch_async` | | 3 | Future | Auto-prefetch in `wp.launch()` | `is_integrated` (to avoid pointless prefetches on shared-DRAM SoCs) | None | | 4 | Future | `wp.copy()` staging-buffer optimization | None (reuses Phase 1 access predicates) | None | -| 5 | Future | Expand `wp.can_access()` to additional resources and richer managed/custom allocation metadata | `concurrent_managed_access` (to distinguish limited vs. full managed memory) | None | +| 5 | Implemented | `wp.ManagedAllocator()`, `wp.MemoryKind`, `array.memory_kind`, managed-memory-aware `wp.can_access()` and checked launches | `managed_memory`, `concurrent_managed_access` | `wp_alloc_device_managed`, `wp_free_device_managed`, `wp_cuda_pointer_get_memory_kind` | +| 6 | Future | CUDA 13 managed memory-pool backend for graph-capturable `wp.ManagedAllocator()` allocations | None; reuses `managed_memory` and memory-pool support state | Extends `wp_alloc_device_managed` / `wp_free_device_managed` | +| 7 | Future | Expand `wp.can_access()` to additional resources and custom/external allocation metadata | None expected initially | None | ### Phase 1: Cross-Device Launch Support @@ -276,9 +311,9 @@ Three CUDA device attributes are needed: - **`CU_DEVICE_ATTRIBUTE_DIRECT_MANAGED_MEM_ACCESS_FROM_HOST`** -- answers "can the CPU directly access CUDA managed memory resident on the GPU without migration?" This does not imply that Warp `wp.array(device="cuda:0")` allocations backed by `cuMemAlloc` via `CudaDefaultAllocator` can be safely passed to CPU kernels. Phase 1 exposes the capability as a device property, but `Device.can_access()` and `LaunchArrayAccessMode.CHECKED` remain conservative for CPU-to-CUDA Warp arrays because Warp's built-in CUDA arrays are not CUDA managed-memory allocations. `LaunchArrayAccessMode.RELAXED` still passes those pointers through when requested by the user. -- **`CU_DEVICE_ATTRIBUTE_HOST_NATIVE_ATOMIC_SUPPORTED`** -- answers "do CPU-GPU atomics work natively across the interconnect?" On systems where this is true (DGX Spark / GB10 and Jetson Thor as tested), a GPU `atomicAdd` targeting a CPU-resident address produces correct results via hardware coherency. On HMM systems, the same operation can silently produce wrong results -- the GPU atomic hits a page backed by CPU physical memory without hardware coherency for atomic operations. Exposing this as a device property lets users and downstream tools (e.g., documentation, `wp.prefetch()` heuristics) reason about atomic safety. This attribute must be treated independently from `direct_managed_mem_access_from_host`. +- **`CU_DEVICE_ATTRIBUTE_HOST_NATIVE_ATOMIC_SUPPORTED`** -- answers whether the CUDA device reports native CPU-GPU atomic hardware capability across the interconnect. This is not a Warp `wp.atomic_*` contract. Current CPU-side Warp atomics are plain read/modify/write operations, and CUDA-side Warp atomics are not specified as system-scope host/device operations. Exposing this as a device property lets users and downstream tools (e.g., documentation, `wp.prefetch()` heuristics) reason about hardware capability, but CPU/GPU atomic algorithms remain follow-up work. This attribute must be treated independently from `direct_managed_mem_access_from_host`. -The first attribute is needed to gate the GPU-accessing-CPU branch in `Device.can_access()`, `wp.can_access(device, array)`, and allocation-aware launch verification. The second and third are exposed as queryable device properties for users who need to reason about managed-memory host access and cross-device atomic safety. `Device.can_access()`, `wp.can_access(device, array)`, and `LaunchArrayAccessMode.CHECKED` do not use `direct_managed_mem_access_from_host` for CPU-to-CUDA default arrays because those are not CUDA managed-memory allocations. +The first attribute is needed to gate the GPU-accessing-CPU branch in `Device.can_access()`, `wp.can_access(device, array)`, and allocation-aware launch verification. The second and third are exposed as queryable device properties for users who need to reason about managed-memory host access and CUDA-reported cross-device atomic hardware capability. `Device.can_access()`, `wp.can_access(device, array)`, and `LaunchArrayAccessMode.CHECKED` do not use `direct_managed_mem_access_from_host` for CPU-to-CUDA default arrays because those are not CUDA managed-memory allocations. **Native layer changes (`warp/native/warp.cu`, `warp/native/warp.h`)** @@ -361,8 +396,10 @@ Add documented Python `Device` properties using Warp's CPU/GPU terminology rathe is_gpu_memory_access_from_cpu_supported (bool): Indicates whether CPU code can directly access CUDA managed memory resident on this device without migration. Does not imply Warp default CUDA arrays are CPU-accessible. ``False`` for CPU devices. - is_cpu_gpu_atomic_supported (bool): Indicates whether native atomic operations between CPU and GPU - memory are supported for this device. ``False`` for CPU devices. + is_cpu_gpu_atomic_supported (bool): Indicates whether the CUDA device reports native CPU/GPU atomic + hardware capability. This is not a guarantee that Warp ``wp.atomic_*`` operations can be used + concurrently from CPU and GPU kernels; current CPU-side Warp atomics are not hardware atomics. + ``False`` for CPU devices. ``` Add the properties to `Device.__init__` for CUDA devices: @@ -458,13 +495,16 @@ def can_access(device: DeviceLike, resource) -> bool: raise TypeError("wp.can_access() only supports Warp arrays in this release") ``` -`wp.can_access(device, array)` answers whether code running on `device` can directly dereference the memory backing `array`. It is allocation-aware where Warp can identify or trust the allocation: +`wp.can_access(device, array)` answers whether code running on `device` can directly dereference the memory backing `array`. It is memory-kind-aware where Warp can classify the backing pointer and has enough ownership metadata for the relevant access predicate: - Same device/context returns `True`. - CUDA device accessing a CPU array returns `True` for pinned CPU arrays on UVA CUDA devices, and otherwise follows `device.is_cpu_memory_access_from_gpu_supported`. -- CPU accessing a CUDA array returns `False` for current Warp CUDA arrays because Warp's CUDA allocators do not create CUDA managed-memory allocations. -- CUDA device accessing a CUDA array on another CUDA device uses peer access for `CudaDefaultAllocator` arrays and memory-pool access for `CudaMempoolAllocator` arrays. -- CUDA arrays backed by custom allocators or externally wrapped allocations return `False` because Warp cannot know whether peer access, memory-pool access, managed-memory semantics, or some other authorization path applies. +- CPU accessing non-managed CUDA device or memory-pool arrays returns `False`; CPU access to CUDA managed arrays follows the managed-memory predicates introduced in Phase 5. +- CUDA device accessing CUDA managed memory uses managed-memory predicates, not peer or memory-pool predicates. +- CUDA device accessing another CUDA device's ordinary CUDA device memory uses peer access, including for externally wrapped pointers that CUDA classifies as `wp.MemoryKind.CUDA_DEVICE`. +- CUDA device accessing another CUDA device's Warp-owned memory-pool allocation uses memory-pool access. +- Externally wrapped or custom CUDA memory-pool allocations remain unknown for cross-device access proof because CUDA pointer attributes identify the memory kind but not whether Warp's queried default-pool access state applies to that specific pool. +- Unclassified custom or external CUDA pointers return `False` through `wp.can_access()` and warn in checked launches. `False` therefore means "Warp cannot verify that this resource is directly accessible", not necessarily "the hardware could never access this pointer." Advanced users may still use `LaunchArrayAccessMode.RELAXED` to pass pointers through when they know the allocation is valid for the launch device. @@ -472,7 +512,7 @@ The API intentionally does not support `wp.can_access(device, device)`. Device-l Any internal or public path that has a concrete array should prefer `wp.can_access(device, array)` over `Device.can_access(array.device)`. This includes `LaunchArrayAccessMode.CHECKED` and the future `wp.copy()` staging optimization. `Device.can_access()` is useful only when no concrete resource is available and the caller accepts a coarse answer for the target device's current built-in allocation mode. -Implementation follows Warp array views to their owner allocation where possible. `wp.can_access()` remains a conservative boolean wrapper, while `LaunchArrayAccessMode.CHECKED` uses a private tri-state classifier to distinguish known-inaccessible allocations from unknown custom or external allocation provenance: +Implementation follows Warp array views to their owner allocation where possible. `wp.can_access()` remains a conservative boolean wrapper, while `LaunchArrayAccessMode.CHECKED` uses a private tri-state classifier to distinguish known-inaccessible allocations from unknown custom or external access paths: ```python def _get_array_allocator(value): @@ -565,7 +605,7 @@ if value.device != device: _raise_launch_array_access_error(kernel, arg_name, value, device) ``` -`LaunchArrayAccessMode.CHECKED` checks the actual Warp array allocation where Warp can determine it. Known-inaccessible allocations raise before launch. Unknown custom allocator or externally wrapped allocations warn through a bounded cache keyed by `(kernel, argument name, source device, launch device)` and then proceed, leaving legality to the user and hardware. +`LaunchArrayAccessMode.CHECKED` checks the actual Warp array pointer where Warp can classify it. Known-accessible pointers proceed, known-inaccessible pointers raise before launch, and unknown access cases warn through a bounded cache keyed by `(kernel, argument name, source device, launch device)` before proceeding. Unknown cases include unclassified CUDA pointers and externally wrapped or custom memory-pool pointers whose specific pool access state Warp cannot prove. The policy helper is responsible for mode validation: @@ -623,9 +663,10 @@ the default. ``LaunchArrayAccessMode.STRICT`` requires every Warp array argument to be on the launch device, matching Warp's original behavior. ``LaunchArrayAccessMode.CHECKED`` checks whether cross-device Warp array arguments are accessible from the launch device before passing their pointers to -the kernel. For Warp-owned arrays, checked mode uses the array's allocation type -where Warp can determine it. Unknown custom or externally wrapped allocation -provenance warns through a bounded per-launch-pattern cache and then proceeds. +the kernel. Checked mode uses the array's observed memory kind and ownership +metadata where Warp can determine them. Unknown access cases, including +unclassified CUDA pointers and unowned memory-pool pointers, warn through a +bounded per-launch-pattern cache and then proceed. Unlike ``verify_cuda``, this setting can be used during CUDA graph capture because checks run before each launch is recorded. For cross-GPU graph capture, @@ -680,7 +721,7 @@ Strict mode (`LaunchArrayAccessMode.STRICT`): every cross-device Warp array argu | `cpu` | `cuda:0` | **RuntimeError** | | `cuda:0` | `cuda:1` | **RuntimeError** | -Checked mode (`LaunchArrayAccessMode.CHECKED`): each Warp-owned array argument is checked with allocation-aware launch verification where Warp can determine the allocator. Unknown custom or external allocation provenance warns through a bounded per-launch-pattern cache and then proceeds. +Checked mode (`LaunchArrayAccessMode.CHECKED`): each Warp array argument is checked with memory-kind-aware launch verification where Warp can determine the backing memory class and relevant access predicate. Unknown access cases, including unclassified CUDA pointers and unowned memory-pool pointers, warn through a bounded per-launch-pattern cache and then proceed. | Launch device | Array device | Discrete GPU (no HMM) | HMM system | Jetson Thor | Host-page-table ATS (DGX Spark / GB10 observed) | |---|---|---|---|---|---| @@ -752,19 +793,14 @@ This wraps `cuMemPrefetchAsync` (driver API). The `device_ordinal` can be `-1` t - **v1** (CUDA 8.0+, version 8000): `cuMemPrefetchAsync(CUdeviceptr, size_t, CUdevice dstDevice, CUstream)` -- takes a simple `CUdevice` ordinal for the destination. - **v2** (CUDA 12.8+, version 12080): `cuMemPrefetchAsync(CUdeviceptr, size_t, CUmemLocation location, unsigned int flags, CUstream)` -- takes a `CUmemLocation` struct (supports NUMA node targeting) and flags. -In CUDA 13.0 headers, `cuMemPrefetchAsync` is `#define`'d to `cuMemPrefetchAsync_v2`. Warp dynamically loads driver entry points via `cuGetProcAddress`, so the implementation must handle both versions: +In CUDA 13.0 headers, `cuMemPrefetchAsync` is `#define`'d to `cuMemPrefetchAsync_v2`. Following Warp's Driver API entry-point policy, the implementation should avoid the macro-selected newer ABI and request the oldest sufficient entry point explicitly: ```cpp // In init_cuda_driver(), load the prefetch entry point: -#if CUDA_VERSION >= 12080 -if (driver_version >= 12080) - get_driver_entry_point("cuMemPrefetchAsync", 12080, &(void*&)pfn_cuMemPrefetchAsync_v2); -else -#endif - get_driver_entry_point("cuMemPrefetchAsync", 8000, &(void*&)pfn_cuMemPrefetchAsync_v1); +get_driver_entry_point("cuMemPrefetchAsync", 8000, &(void*&)pfn_cuMemPrefetchAsync_v1); ``` -The `wp_cuda_mem_prefetch_async` wrapper dispatches to whichever version was loaded: +The `wp_cuda_mem_prefetch_async` wrapper uses the v1 signature: ```cpp int wp_cuda_mem_prefetch_async(void* ptr, size_t size_in_bytes, @@ -773,20 +809,6 @@ int wp_cuda_mem_prefetch_async(void* ptr, size_t size_in_bytes, CUdeviceptr devPtr = (CUdeviceptr)ptr; CUstream hStream = (CUstream)stream; -#if CUDA_VERSION >= 12080 - if (pfn_cuMemPrefetchAsync_v2) { - CUmemLocation location; - if (device_ordinal >= 0) { - location.type = CU_MEM_LOCATION_TYPE_DEVICE; - location.id = device_ordinal; - } else { - location.type = CU_MEM_LOCATION_TYPE_HOST; - location.id = 0; - } - return check_cu(pfn_cuMemPrefetchAsync_v2(devPtr, size_in_bytes, - location, 0, hStream)) ? 0 : -1; - } -#endif if (pfn_cuMemPrefetchAsync_v1) { CUdevice dstDevice = (device_ordinal >= 0) ? g_devices[device_ordinal].device @@ -798,9 +820,9 @@ int wp_cuda_mem_prefetch_async(void* ptr, size_t size_in_bytes, } ``` -This pattern follows the existing `cuMemcpyBatchAsync` convention in -`cuda_util.cpp`, which uses the same `#if CUDA_VERSION >= 12080` / -`driver_version >= 12080` dispatch pattern. +This deliberately differs from APIs such as `cuMemcpyBatchAsync`, where Warp +needs newer semantics and therefore gates the newer entry point with both +`#if CUDA_VERSION` and `driver_version` checks. **Compile-time / runtime compatibility matrix for Phase 2:** @@ -808,11 +830,11 @@ This pattern follows the existing `cuMemcpyBatchAsync` convention in |---|---|---|---| | CUDA 12.0 -- 12.7 | Any 12.0+ | Yes | v1 (CUdevice) | | CUDA 12.8+ | Driver < 12.8 | Yes | v1 (CUdevice) | -| CUDA 12.8+ | Driver >= 12.8 | Yes | v2 (CUmemLocation) | +| CUDA 12.8+ | Driver >= 12.8 | Yes | v1 (CUdevice); v2 reserved for future NUMA targeting | -The v1 API is fully sufficient for the `wp.prefetch()` use case (migrate to a device or to the CPU). The v2 API adds NUMA node targeting which is not needed initially but is available when both toolkit and driver support it. +The v1 API is fully sufficient for the `wp.prefetch()` use case (migrate to a device or to the CPU). The v2 API adds NUMA node targeting, which is not needed initially. If Warp later exposes NUMA targeting, that path should add a separate v2 load guarded by both toolkit headers and runtime driver support. -**Disabling prefetch on older CUDA:** If Warp is compiled with CUDA 12.0 -- 12.7, only the v1 entry point is loaded. The v1 API works for `cudaMallocManaged` allocations on all systems, and also for system-allocated (`malloc`) memory on HMM / host-page-table ATS systems. The Python `wp.prefetch()` wrapper should catch errors from the driver (e.g., if the pointer is not in a prefetchable region) and emit a warning rather than raising, since prefetch is a performance hint. +**Disabling prefetch on older CUDA:** Warp loads the v1 entry point across toolkit versions. The v1 API works for `cudaMallocManaged` allocations on all systems, and also for system-allocated (`malloc`) memory on HMM / host-page-table ATS systems. The Python `wp.prefetch()` wrapper should catch errors from the driver (e.g., if the pointer is not in a prefetchable region) and emit a warning rather than raising, since prefetch is a performance hint. Implementation notes: - `cuMemPrefetchAsync` works on any pointer that falls within a unified memory region -- including plain `malloc` on HMM / host-page-table ATS systems, `cuMemAllocManaged` allocations, and `cuMemAlloc` allocations on systems where device allocations are host-accessible. @@ -955,54 +977,231 @@ if src.device != dest.device: This is a performance optimization and not required for correctness -- the existing staging approach works correctly on all systems. -### Phase 5: Expand Resource and Allocation Metadata (Future) +### Phase 5: Managed Allocator and Memory Kind -**Goal:** Extend the public `wp.can_access(device, resource)` API introduced in Phase 1 beyond Warp arrays, and add richer allocation metadata for cases Phase 1 must conservatively treat as unknown. Candidate future resources include hash grids and meshes. Candidate allocation refinements include future managed-memory allocators, custom allocator access declarations, and externally wrapped allocations with explicit access metadata. +**Goal:** Add an explicit managed-memory allocation path for Warp arrays without changing the meaning of `device` or the default CUDA allocator. Managed memory is a CUDA pointer memory kind, not a new device. A managed Warp array remains associated with the CUDA device used to allocate it, but `wp.can_access()` and `LaunchArrayAccessMode.CHECKED` can apply managed-memory access rules instead of treating the pointer as ordinary CUDA device memory or an unknown pointer. -This phase introduces one additional device attribute. +This phase introduces `wp.ManagedAllocator()`, `wp.MemoryKind`, `array.memory_kind`, two additional CUDA device attributes, managed native allocation wrappers, and a native CUDA pointer classifier. + +#### Public API: `wp.ManagedAllocator` -#### New device attribute: `concurrent_managed_access` +`wp.ManagedAllocator` is a top-level allocator class that satisfies the existing `Allocator` protocol. It has no device argument and no public attach-flag argument. The allocator object is not bound to one CUDA device and can be constructed before any CUDA context is current. Each allocation still happens under the target device's CUDA context, and that device must report CUDA managed-memory support. It is used through the same APIs as other CUDA allocators: -**CUDA attribute:** `CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS` +```python +managed = wp.ManagedAllocator() + +with wp.ScopedAllocator("cuda:0", managed): + data = wp.empty(1024, dtype=wp.float32, device="cuda:0") +``` + +The array's `device` remains `cuda:0`: + +```python +data.device == wp.get_device("cuda:0") +``` -This attribute distinguishes the "limited" unified memory paradigm (limited Tegra, Windows -- `concurrent_managed_access == 0`) from the "full" paradigms (`concurrent_managed_access == 1`). On limited systems, `cudaMallocManaged` allocations bulk-migrate and cannot be concurrently accessed by CPU and GPU. On full systems, managed allocations support page-granularity migration with concurrent access. +The `ManagedAllocator` constructor intentionally does not take a device. A device argument would suggest that `cudaMallocManaged` immediately places pages in that device's physical memory, which CUDA does not guarantee. The target device/context still matters for allocation API calls: Warp's array constructors push the target CUDA context before invoking the allocator, and direct calls to `ManagedAllocator.allocate()` require the caller to have already made a managed-memory-capable CUDA context current. `ManagedAllocator` records pointer-to-owner-context metadata internally so one allocator instance can serve multiple CUDA devices. Physical placement is left to CUDA Unified Memory and can be guided explicitly through `wp.prefetch()` once Phase 2 exists. -Expanded allocation metadata needs this because it must answer: "if this specific array was allocated with `cudaMallocManaged` (a future Warp managed allocator), can the GPU access it concurrently with the CPU?" The answer depends on this attribute. +`wp.ManagedAllocator()` always uses global managed-memory attach semantics. For the direct fallback path this means `cudaMallocManaged(..., cudaMemAttachGlobal)`. Warp does not expose `cudaMemAttachHost` or `cudaMemAttachSingle` in the initial API; those are specialized ownership/scheduling controls better left to custom allocators or a future explicit stream-attach API. -#### Allocator and resource tracking +Users may install one `ManagedAllocator` for all CUDA devices: -Phase 1 `wp.can_access(device, array)` and launch verification already distinguish the allocation classes Warp can identify: +```python +wp.set_cuda_allocator(wp.ManagedAllocator()) +``` -- `CpuDefaultAllocator` -- uses `wp_alloc_host` (wraps `malloc`/`calloc`) -- `CpuPinnedAllocator` -- uses `wp_alloc_pinned` (wraps `cudaMallocHost`) -- `CudaDefaultAllocator` -- uses `wp_alloc_device_default` (wraps `cuMemAlloc`) -- `CudaMempoolAllocator` -- uses `wp_alloc_device_async` (wraps `cuMemAllocAsync`) +Because allocation happens under the target device context and the allocator object stores no device of its own, sharing one instance across multiple managed-memory-capable CUDA devices is valid. The Python allocator records the owner context for each pointer so deallocation uses the same context even if another CUDA context is current. Direct calls to `ManagedAllocator.allocate()` require an active CUDA context whose device supports managed memory; array factory calls pass the target device context automatically. -On a discrete GPU without HMM: -- Pinned CPU allocations (`CpuPinnedAllocator`) ARE GPU-accessible through UVA, and Phase 1 `wp.can_access(device, array)` plus launch verification accept Warp-owned pinned CPU arrays when `device.is_uva` is true. `Device.can_access()` remains a device-level/default-allocation query and still does not distinguish pinned CPU arrays from ordinary CPU arrays. -- Default CPU allocations (`CpuDefaultAllocator`) are NOT GPU-accessible. -- Both CUDA allocators produce GPU-only memory. +#### Memory-kind inspection -Phase 5 should preserve the same public API shape: +Before this phase, access rules were inferred mostly from allocator class identity. Managed arrays and arrays received from another layer need a first-class memory-kind query so users can inspect what Warp can observe about the backing pointer without relying on a private allocator object. The public enum is: + +```python +class MemoryKind(enum.Enum): + HOST = "host" + PINNED_HOST = "pinned_host" + CUDA_DEVICE = "cuda_device" + CUDA_MEMPOOL = "cuda_mempool" + CUDA_MANAGED = "cuda_managed" + UNKNOWN = "unknown" +``` + +Expose the query through `array.memory_kind` on concrete `wp.array` instances. The property returns a `wp.MemoryKind` value and follows views to their owner allocation. CPU arrays are classified from Warp array state, including zero-sized arrays. CUDA arrays are classified from CUDA Driver API pointer attributes: + +- `CU_POINTER_ATTRIBUTE_IS_MANAGED` identifies managed memory. +- `CU_POINTER_ATTRIBUTE_MEMORY_TYPE` distinguishes host and device memory. +- `CU_POINTER_ATTRIBUTE_MEMPOOL_HANDLE` identifies CUDA memory-pool allocations. + +The CUDA classifier returns `wp.MemoryKind.UNKNOWN` if CUDA cannot classify the pointer. This lets externally wrapped managed, default-device, and memory-pool pointers report their observed CUDA memory class, while still keeping access checks conservative when Warp lacks the allocator ownership metadata needed to prove cross-device access. `wp.MemoryKind.CUDA_DEVICE` intentionally covers CUDA device memory that is not classified as managed or memory-pool memory; it is not named after one specific allocation API. + +The memory kind reports the observed pointer class only. It does not report current physical residency of managed pages, synchronization state, peer or memory-pool access authorization, or whether a CPU/GPU can safely access the pointer at that moment. Accessibility remains a separate query through `wp.can_access(device, array)`. Indexed arrays do not expose a single public memory kind because data and index buffers can be backed by different allocations; `wp.can_access()` and checked launches inspect both buffers separately. + +#### New device attributes + +Two CUDA attributes are used: + +- **`CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY`** -- answers whether the device can allocate managed memory on this system. Expose this as `Device.is_managed_memory_supported`. +- **`CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS`** -- distinguishes limited managed-memory systems from full concurrent managed-memory systems. Expose this as `Device.is_concurrent_managed_access_supported`. + +Native layer additions follow the Phase 1 pattern: + +```cpp +struct DeviceInfo { + // ... existing fields ... + int managed_memory = 0; + int concurrent_managed_access = 0; +}; +``` + +Query during device enumeration: + +```cpp +check_cu(cuDeviceGetAttribute_f( + &g_devices[i].managed_memory, + CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY, device)); +check_cu(cuDeviceGetAttribute_f( + &g_devices[i].concurrent_managed_access, + CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, device)); +``` + +Expose accessors: + +```cpp +WP_API int wp_cuda_device_get_managed_memory(int ordinal); +WP_API int wp_cuda_device_get_concurrent_managed_access(int ordinal); +``` + +Register ctypes bindings in `Runtime.__init__`, then set the Python properties on CUDA devices. CPU devices set both to `False`. + +#### Native managed allocation backend + +The native managed allocation API exposes these wrappers: + +```cpp +WP_API void* wp_alloc_device_managed(void* context, size_t size, const char* tag = nullptr); +WP_API bool wp_free_device_managed(void* context, void* ptr); +``` + +The native wrapper chooses the backend: + +1. If the context's CUDA device does not support managed memory, return `NULL` and set an error string. +2. If CUDA stream capture is active, return `NULL` and set a clear error: managed allocation during CUDA graph capture is not supported. This avoids calling `cudaMallocManaged()` inside capture and invalidating the capture. +3. Otherwise allocate with `cudaMallocManaged(&ptr, size, cudaMemAttachGlobal)`. + +Free managed allocations with `cudaFree()`, using the same deferred-free approach as `CudaDefaultAllocator` when graph captures are active. + +Local verification on a Blackwell GPU with driver 580.126.20 confirmed: + +- `cudaMallocManaged` during global or thread-local stream capture returns `cudaErrorStreamCaptureUnsupported` and invalidates capture. +- A pointer allocated by `cudaMallocManaged` before capture can be used by captured kernels successfully. + +The native wrapper must therefore guard the allocation path before calling `cudaMallocManaged` during capture. + +#### Managed access rules + +For `wp.MemoryKind.CUDA_MANAGED`, `wp.can_access(device, array)` and `LaunchArrayAccessMode.CHECKED` use managed-memory predicates, not peer or memory-pool predicates: + +- Same device/context: `True`. +- CUDA device accessing a managed CUDA array: `True` when the target CUDA device reports `is_managed_memory_supported`. Warp peer access and Warp memory-pool access are not required; CUDA Unified Memory handles migration and visibility. P2P topology can still affect placement and performance. +- CPU accessing a managed CUDA array: `True` only when the owner CUDA device reports `is_concurrent_managed_access_supported` or `is_gpu_memory_access_from_cpu_supported`. Limited managed-memory systems return `False` because CPU access depends on synchronization state that `wp.can_access()` cannot verify. +- CPU/GPU atomics are not implied by managed memory. `device.is_cpu_gpu_atomic_supported` reports only the CUDA hardware capability bit, and current Warp `wp.atomic_*` operations must not be used as overlapping CPU/GPU interprocessor atomics. + +The CPU rule is intentionally conservative. On limited managed-memory systems, CUDA permits some CPU access patterns after synchronization, but `wp.can_access()` is not a synchronization-state query and checked launch validation cannot prove that the CPU will avoid accessing the allocation while the GPU is active. + +#### Interop and array behavior + +Managed arrays remain CUDA arrays in Warp: + +- `array.device` is the CUDA device used for allocation. +- `array.pinned` is `False`. +- `__cuda_array_interface__` remains available. +- DLPack exports as CUDA, not CPU. +- `array.cptr()` remains unavailable because the array is not a CPU array. +- `array.numpy()` keeps the existing copy-to-CPU behavior in Phase 5. Zero-copy NumPy views over managed CUDA arrays are a separate feature because they would change host synchronization and lifetime expectations. +- CUDA IPC rejects managed allocations in Phase 5, matching the conservative handling for memory-pool allocations. IPC support for managed memory can be considered separately if a concrete use case appears. + +`wp.prefetch()` from Phase 2 accepts managed arrays and is the explicit physical-placement hint for users who want to move pages toward a GPU or back toward the CPU. `wp.ManagedAllocator()` itself does not promise initial residency. + +### Phase 6: CUDA 13 Managed Memory-Pool Allocation + +**Goal:** Extend `wp.ManagedAllocator()` so CUDA 13.0+ builds can allocate managed arrays during CUDA graph capture when the device supports managed memory pools. This is a native backend extension, not a new public memory kind. Arrays still report `wp.MemoryKind.CUDA_MANAGED`, and CUDA 12.x builds keep the Phase 5 `cudaMallocManaged()` behavior. + +#### Managed pool creation + +Phase 6 creates one private managed memory pool per CUDA context/device when the feature is available. The pool is not Warp's ordinary CUDA memory pool; it is a separate CUDA pool configured for managed memory: + +```cpp +#if CUDART_VERSION >= 13000 +cudaMemPoolProps props = {}; +props.allocType = cudaMemAllocationTypeManaged; +props.handleTypes = cudaMemHandleTypeNone; +props.location.type = cudaMemLocationTypeDevice; +props.location.id = device_ordinal; +cudaMemPoolCreate(&managed_pool, &props); +#endif +``` + +The CUDA 13 compile-time guard is required because `cudaMemAllocationTypeManaged` is not defined in CUDA 12.x headers. Warp should not define a local stand-in enum value or try to emulate this on CUDA 12.9: CUDA 12.9 documents memory-pool allocation type support as pinned-only. + +Managed pool creation should happen outside CUDA graph capture. During capture, the native allocator may use an already-created managed pool but must not attempt capture-unsafe pool creation or direct `cudaMallocManaged()` fallback. For captures started by Warp, initialization can happen before capture begins. For external captures, users may need to allocate one managed array before starting capture so Warp can initialize the pool for that device. + +#### Allocation and free path + +The Phase 6 `wp_alloc_device_managed()` backend chooses the allocation path in this order: + +1. If the context's CUDA device does not support managed memory, return `NULL` and set an error string. +2. If Warp was compiled with CUDA 13.0+, the device supports CUDA memory pools, and the managed pool is available, allocate with `cudaMallocFromPoolAsync(&ptr, size, managed_pool, stream)` on the current Warp stream. +3. If no managed pool is available and no CUDA stream capture is active, allocate with `cudaMallocManaged(&ptr, size, cudaMemAttachGlobal)`. This remains the only path on CUDA 12.x builds and is also the fallback on CUDA 13+ builds where managed pool creation fails. +4. If no managed pool is available and capture is active, return `NULL` and set a clear error: managed allocation during CUDA graph capture requires an initialized managed memory pool. + +Managed-pool allocations should free with `cudaFreeAsync()` and reuse or generalize Warp's existing stream-ordered graph-allocation bookkeeping from `wp_alloc_device_async()` / `wp_free_device_async()`. Direct `cudaMallocManaged()` fallback allocations should continue to free with `cudaFree()`, using the same deferred-free approach as `CudaDefaultAllocator` when graph captures are active. The native layer must track whether each managed pointer came from the direct or pool path so it can choose the correct free API and reject unknown managed pointers clearly. + +Local verification on a Blackwell GPU with driver 580.126.20 confirmed: + +- `cudaMallocManaged` during global or thread-local stream capture returns `cudaErrorStreamCaptureUnsupported` and invalidates capture. +- `cudaMallocFromPoolAsync` from a pool with `cudaMemAllocationTypeManaged` captures, instantiates, launches, and synchronizes successfully. +- A pointer allocated by `cudaMallocManaged` before capture can be used by captured kernels successfully. + +The native wrapper must therefore continue to guard the direct fallback path before calling `cudaMallocManaged` during capture, even after the managed-pool path exists. + +#### Public API and docs + +Phase 6 should not add another public allocator class. `wp.ManagedAllocator()` remains the opt-in API for CUDA managed memory. The implementation may opportunistically use a managed pool when the build, driver, and device support it, but the public promise remains observed memory kind and access validation, not initial physical residency. + +Documentation should make the graph-capture distinction explicit: + +- Managed arrays allocated before capture can be used by captured kernels on all supported builds where managed allocation itself succeeds. +- Managed allocation during capture requires CUDA 13.0+ and an initialized managed memory pool. +- CUDA 12.x builds, including Warp's CUDA 12.9 PyPI build, reject capture-time managed allocation because their only supported backend is `cudaMallocManaged()`. +- CUDA 13+ builds may still reject capture-time managed allocation when the device does not support CUDA memory pools or Warp cannot create the managed pool. + +### Phase 7: Custom and External Allocation Metadata + +Phase 5 classifies CUDA pointers through Driver API attributes, including externally wrapped managed, ordinary CUDA device, and CUDA memory-pool pointers. Classification is not the same as proving cross-device access: + +- `wp.can_access(device, array)` can use managed-memory predicates for `wp.MemoryKind.CUDA_MANAGED` and peer-access predicates for `wp.MemoryKind.CUDA_DEVICE`. +- `wp.can_access(device, array)` remains conservative for externally wrapped or custom `wp.MemoryKind.CUDA_MEMPOOL` pointers because Warp cannot prove that the pointer belongs to the default pool whose access state it queried. +- `LaunchArrayAccessMode.CHECKED` warns once per launch pattern and proceeds only for unknown access cases such as unclassified CUDA pointers or unowned memory-pool pointers. + +Phase 7 may add an explicit protocol so custom allocators or external wrappers can declare access predicates or additional allocation metadata. It can also extend `wp.can_access(device, resource)` to resources such as hash grids and meshes while preserving the resource-oriented API shape: ```python wp.can_access(device, hash_grid) wp.can_access(device, mesh) ``` -It should not add `wp.can_access(device, device)`. Device-to-device/default-allocation checks should continue to live on `Device.can_access(other_device)`. - -For custom allocators and externally wrapped allocations, Phase 5 may add an explicit metadata protocol so owners can declare which access predicate applies. Until such metadata exists, cross-device `wp.can_access(device, array)` remains conservative and returns `False` for unknown CUDA allocations. +It should still not add `wp.can_access(device, device)`. Device-to-device/default-allocation checks should continue to live on `Device.can_access(other_device)`. ## Testing Strategy -### Phase 1 tests +### Phase 1 test coverage -Add a test module `warp/tests/cuda/test_unified_memory.py` (registered in `warp/tests/unittest_suites.py`) and extend `warp/tests/test_graph.py` for CUDA graph capture coverage. +Coverage lives in `warp/tests/cuda/test_unified_memory.py` (registered in `warp/tests/unittest_suites.py`) and CUDA graph capture tests. **Attribute query tests (run on all hardware):** - Verify `is_cpu_memory_access_from_gpu_supported`, `is_gpu_memory_access_from_cpu_supported`, and `is_cpu_gpu_atomic_supported` are `bool` for CUDA devices and `False` for CPU devices. -- Do not assert that `is_cpu_gpu_atomic_supported` implies `is_gpu_memory_access_from_cpu_supported`; Jetson Thor reports native CPU-GPU atomics while still rejecting direct CPU access to `cudaMalloc` memory. +- Do not assert that `is_cpu_gpu_atomic_supported` implies `is_gpu_memory_access_from_cpu_supported`; Jetson Thor reports native CPU-GPU atomic hardware capability while still rejecting direct CPU access to `cudaMalloc` memory. +- Do not add Warp-level tests that assert overlapping CPU/GPU `wp.atomic_*` updates are correct. That behavior is unsupported until follow-up work adds real CPU atomic lowering and the required GPU system-scope semantics. **`Device.can_access()` tests (run on all hardware):** - `device.can_access(device)` is always `True` for every device. @@ -1017,8 +1216,8 @@ Add a test module `warp/tests/cuda/test_unified_memory.py` (registered in `warp/ - Same-device arrays return `True`. - CPU arrays checked from a CUDA device match `device.is_cpu_memory_access_from_gpu_supported` for pageable CPU arrays and return `True` for pinned CPU arrays when `device.is_uva` is true. - CPU checking a Warp CUDA array returns `False`. -- CUDA arrays checked from another CUDA device use peer access for default CUDA allocations and memory-pool access for CUDA memory-pool allocations. -- Cross-device CUDA arrays backed by custom allocators or externally wrapped allocations return `False` unless they are on the same device/context. +- CUDA arrays checked from another CUDA device use managed-memory predicates for managed memory, peer access for ordinary CUDA device memory, and memory-pool access for Warp-owned CUDA memory-pool allocations. +- Externally wrapped ordinary CUDA device pointers use peer access, externally wrapped managed pointers use managed-memory predicates, and externally wrapped or custom memory-pool pointers return `False` through `wp.can_access()` when Warp cannot prove the pointer's specific pool access state. - Passing a device as the second argument (`wp.can_access(device, other_device)`) raises `TypeError`. **Cross-device launch tests (hardware-dependent, skip on incapable systems):** @@ -1032,7 +1231,7 @@ Add a test module `warp/tests/cuda/test_unified_memory.py` (registered in `warp/ - With `LaunchArrayAccessMode.STRICT`: verify that any cross-device Warp array argument raises `RuntimeError`, including cases that `CHECKED` would allow, such as pinned CPU arrays on UVA CUDA devices or ordinary CPU arrays on HMM / host-page-table ATS systems. - With `LaunchArrayAccessMode.CHECKED` on a discrete GPU without HMM: verify that launching with a CPU array raises `RuntimeError` (not a CUDA fault). - With `LaunchArrayAccessMode.CHECKED` on an HMM / host-page-table ATS system: verify that GPU launches with CPU arrays still succeed (no false positive). -- With `LaunchArrayAccessMode.CHECKED`: verify that cross-device arrays backed by custom or externally wrapped CUDA allocations warn through a bounded cache keyed by `(kernel, argument name, source device, launch device)` and proceed. +- With `LaunchArrayAccessMode.CHECKED`: verify that unknown cross-device access cases, including unclassified CUDA pointers and unowned memory-pool pointers, warn through a bounded cache keyed by `(kernel, argument name, source device, launch device)` and proceed. - With `LaunchArrayAccessMode.CHECKED` during CUDA graph capture: capture and replay a same-device CUDA launch successfully. - On multi-GPU systems with a peer-access-supported pair: allocate with CUDA memory pools disabled, enable peer access before capture, pass an array from the source GPU to a kernel launched on the peer GPU with `LaunchArrayAccessMode.CHECKED`, capture and replay the graph, and verify the results. Skip cleanly when no peer-access pair exists. - On multi-GPU systems with a memory-pool-access-supported pair: allocate with CUDA memory pools enabled, enable memory-pool access before capture, pass an array from the source GPU to a kernel launched on the peer GPU with `LaunchArrayAccessMode.CHECKED`, capture and replay the graph, and verify the results. @@ -1053,10 +1252,71 @@ Add a test module `warp/tests/cuda/test_unified_memory.py` (registered in `warp/ - Enable `warp.config.auto_prefetch`, launch cross-device kernel, verify correctness. - Verify auto-prefetch is not issued on integrated GPUs (may require mocking or checking driver call counts). +### Phase 5 test coverage (managed allocator) + +**Capability and allocator tests (run on all CUDA hardware):** +- Verify `is_managed_memory_supported` and `is_concurrent_managed_access_supported` are `bool` for CUDA devices and `False` for CPU devices. +- Verify `wp.ManagedAllocator()` satisfies the `Allocator` protocol and can be constructed without an active CUDA context. +- Verify direct `ManagedAllocator.allocate()` calls without an active CUDA context raise clearly. +- On CUDA devices with managed-memory support, allocate with `wp.ManagedAllocator()` through `wp.ScopedAllocator()` and verify the resulting array has `device == cuda_device`, `pinned == False`, and `array.memory_kind == wp.MemoryKind.CUDA_MANAGED`. +- On CUDA devices without managed-memory support, allocation through `wp.ManagedAllocator()` raises a clear `RuntimeError`. +- Verify one shared `wp.ManagedAllocator()` instance works through `wp.set_cuda_allocator()` across multiple managed-memory-capable CUDA devices and frees each pointer through the recorded owner context. + +**Managed access predicate tests:** +- Same-device managed arrays return `True` from `wp.can_access(device, array)`. +- CUDA devices with `is_managed_memory_supported` return `True` for managed arrays, including managed arrays associated with another CUDA device. These tests should not require `wp.set_peer_access_enabled()` or `wp.set_mempool_access_enabled()`. +- CPU access to a managed CUDA array follows `owner.is_concurrent_managed_access_supported or owner.is_gpu_memory_access_from_cpu_supported`. +- `wp.can_access("cpu", managed_array)` returns `False` on limited managed-memory systems. +- Array views follow the owner array's memory kind through `_ref`; indexed arrays must check both data arrays and index arrays for access. +- Externally wrapped managed CUDA pointers can be classified through CUDA pointer attributes. +- Externally wrapped or custom CUDA memory-pool pointers remain conservative for cross-device access when Warp lacks ownership metadata, even if their observed memory kind is `wp.MemoryKind.CUDA_MEMPOOL`. + +**Launch and graph tests:** +- With `LaunchArrayAccessMode.CHECKED`, CUDA launches receiving managed arrays are accepted when the target CUDA device supports managed memory. +- With `LaunchArrayAccessMode.CHECKED`, CPU launches receiving managed CUDA arrays are accepted only when the owner device supports concurrent managed access or direct managed host access. +- Managed arrays allocated before CUDA graph capture can be used by captured kernels. +- Managed allocation inside CUDA graph capture raises clearly in the Phase 5 implementation. +- `cudaMallocManaged` allocation is tested outside capture on managed-memory-capable CUDA devices. + +**Interop tests:** +- Managed arrays expose `__cuda_array_interface__` and export DLPack as CUDA. +- `array.numpy()` returns the expected values through the existing copy-to-CPU path. +- `array.cptr()` raises because managed arrays are still CUDA arrays in Warp. +- CUDA IPC rejects managed arrays with a clear error. + +### Phase 5 documentation + +- `wp.ManagedAllocator` is included in the CUDA memory-management API reference. +- `wp.MemoryKind` and `array.memory_kind` documentation distinguish observed memory class from physical residency and accessibility. +- `docs/deep_dive/allocators.rst` includes scoped and global managed-allocation examples. +- `docs/deep_dive/memory_access.rst` distinguishes standard Warp CUDA arrays from managed arrays allocated through `wp.ManagedAllocator()`. +- Managed arrays are documented as not promising initial physical residency. `wp.prefetch()` is the explicit placement hint once Phase 2 exists. +- Graph-capture behavior is documented: managed arrays may be used by captured kernels, but managed allocation must happen before capture. + +### Phase 6 test coverage (managed memory pools) + +**CUDA 13 managed-pool tests:** +- On CUDA 13.0+ builds and devices that support CUDA memory pools, allocate a managed array through `wp.ManagedAllocator()` before capture to initialize the managed pool, then allocate another managed array during CUDA graph capture and verify replayed kernels can read and write it. +- Verify capture-time managed allocation uses the stream-ordered path by capturing allocation, kernel launch, and free/release-sensitive behavior without requiring host synchronization inside capture. +- Verify the same `array.memory_kind == wp.MemoryKind.CUDA_MANAGED` value for direct and managed-pool allocations; no new public memory kind should appear. +- Verify managed-pool allocations free through the stream-ordered path and direct fallback allocations still free through the direct managed path. + +**Fallback and rejection tests:** +- On CUDA 12.x builds, including CUDA 12.9, managed allocation during capture still raises clearly because the managed-pool allocation type is unavailable. +- On CUDA 13.0+ builds where managed pool creation fails or the device lacks memory-pool support, allocation outside capture may fall back to `cudaMallocManaged()`, but allocation during capture raises clearly. +- For externally started CUDA captures, verify capture-time managed allocation raises when the managed pool was not initialized before capture; pre-initializing the pool before capture should allow the allocation path when the device supports it. + +### Phase 6 documentation + +- `docs/deep_dive/allocators.rst` should explain that capture-time managed allocation is a CUDA 13+ managed-pool feature, while CUDA 12.x builds require pre-allocation before capture. +- The docs should state that `wp.ManagedAllocator()` remains the public allocator API and `wp.MemoryKind.CUDA_MANAGED` remains the observed memory kind for both direct and pool-backed managed allocations. +- Error messages should distinguish "managed allocation during capture is unsupported on this build/device" from "initialize the managed pool before capture." + ### CI considerations - The existing CI may not have HMM, ATS, Jetson Thor, or DGX Spark / GB10 hardware. Tests that require specific paradigms should use `unittest.skipUnless` based on the device attributes queried in Phase 1. -- Tests that only query attributes (Phase 1 attribute and `Device.can_access()` / `wp.can_access()` invariant tests) should run on all hardware. +- Tests that only query attributes (Phase 1 / Phase 5 attributes and `Device.can_access()` / `wp.can_access()` invariant tests) should run on all hardware. +- Phase 6 capture-allocation tests should skip unless the build uses CUDA 13.0+ and the target device reports memory-pool support; CUDA 12.x CI should continue to run the capture-time rejection tests. - Consider adding a CI label or tag for "unified memory" tests so they can be selectively run on appropriate hardware. ### Device compatibility matrix for test expectations @@ -1065,8 +1325,13 @@ Add a test module `warp/tests/cuda/test_unified_memory.py` (registered in `warp/ |---|---|---|---|---|---|---| | GPU can access CPU arrays | No | Yes | Yes | No | Yes | Yes | | CPU can access Warp default GPU arrays | No | No | No | No | No | No | -| CPU can access GPU-resident CUDA managed memory | No | No | Yes | No | No | No | -| Native CPU-GPU atomics on host-visible memory | No | No | Yes | Device-dependent | Yes | Yes | +| CPU direct access to GPU-resident CUDA managed memory | No | No | Yes | No | No | No | +| `wp.can_access(cpu, ManagedAllocator array)` | Yes if concurrent managed access or direct managed host access | Yes | Yes | No on limited systems | Yes | Yes | +| CUDA can access `ManagedAllocator` array | Yes if managed memory supported | Yes | Yes | Yes if managed memory supported | Yes | Yes | +| Managed allocation during graph capture (Phase 5) | Not supported | Not supported | Not supported | Not supported | Not supported | Not supported | +| Managed-pool allocation during graph capture (Phase 6) | CUDA 13+ managed-pool support required | CUDA 13+ managed-pool support required | CUDA 13+ managed-pool support required | CUDA 13+ managed-pool support required | CUDA 13+ managed-pool support required | CUDA 13+ managed-pool support required | +| Native CPU-GPU atomic hardware capability | No | No | Yes | Device-dependent | Reports yes | Reports yes | +| Current Warp CPU/GPU `wp.atomic_*` overlap | Unsupported | Unsupported | Unsupported | Unsupported | Unsupported | Unsupported | | Cross-device launch GPU->CPU array (`RELAXED`) | CUDA fault | OK | OK | CUDA fault | OK | OK | | Cross-device launch CPU->GPU array (`RELAXED`) | Segfault | Segfault | Segfault for Warp default arrays | Segfault | Segfault | Segfault for Warp default arrays | | Cross-device launch GPU->CPU array (`STRICT`) | RuntimeError | RuntimeError | RuntimeError | RuntimeError | RuntimeError | RuntimeError | diff --git a/design/pluggable-allocators.md b/design/pluggable-allocators.md index 587334dd37..2cfac52cbb 100644 --- a/design/pluggable-allocators.md +++ b/design/pluggable-allocators.md @@ -216,26 +216,31 @@ internals into the allocator surface. Current limitation: `wp.can_access(device, array)` and `warp.config.launch_array_access_mode = wp.config.LaunchArrayAccessMode.CHECKED` -remain conservative for arrays allocated through custom allocators. +remain conservative for arrays allocated through custom allocators when Warp +cannot classify the pointer or prove the relevant access state. Same-device launches are accepted, but cross-device launches require Warp to know whether the allocation uses default CUDA memory, CUDA memory pools, -pinned host memory, managed memory, or another memory type. The current custom -allocator protocol only returns a pointer, so cross-device arrays backed by -custom or externally wrapped allocators warn once per launch pattern in checked +pinned host memory, managed memory, or another memory type. CUDA pointer +attributes can classify externally wrapped managed and ordinary CUDA device +pointers so Warp can use managed-memory or peer-access predicates. The current +custom allocator protocol still only returns a pointer, so unclassified +pointers and externally wrapped or custom memory-pool pointers whose specific +pool access state cannot be proven warn once per launch pattern in checked mode and then proceed. Using `wp.config.LaunchArrayAccessMode.RELAXED` leaves access legality to the hardware without the diagnostic, matching the default launch path. -Future solutions must provide enough allocation provenance for +Future solutions must provide enough memory-kind and access metadata for `wp.can_access(device, array)` and `wp.config.LaunchArrayAccessMode.CHECKED` to make the same conservative decisions they make for Warp-owned allocations. At a minimum, Warp needs to distinguish the owning device and memory class for allocations that participate in cross-device launch verification, including -default CUDA device memory, CUDA memory pools, managed memory, pinned host -memory, and allocator-defined external memory. +CUDA device memory that is neither managed nor memory-pool memory, CUDA +memory pools, managed memory, pinned host memory, and allocator-defined +external memory. Any future mechanism must remain backward compatible with simple custom -allocators, preserve an "unknown" result when allocation provenance is +allocators, preserve an "unknown" result when memory metadata is unavailable or unrecognized, and avoid exposing framework-specific internals as part of the basic allocator surface. It also needs to keep launch verification compatible with CUDA graph capture and use the same access predicates as diff --git a/docs/api_reference/warp.rst b/docs/api_reference/warp.rst index 599aa54e0b..507652ca70 100644 --- a/docs/api_reference/warp.rst +++ b/docs/api_reference/warp.rst @@ -385,6 +385,8 @@ CUDA Memory Management :toctree: _generated Allocator + ManagedAllocator + MemoryKind ScopedAllocator ScopedMempool ScopedMempoolAccess diff --git a/docs/deep_dive/allocators.rst b/docs/deep_dive/allocators.rst index 37aae3735d..11d400a678 100644 --- a/docs/deep_dive/allocators.rst +++ b/docs/deep_dive/allocators.rst @@ -311,6 +311,132 @@ For temporary allocator changes, use the :class:`ScopedAllocator` context manage a = wp.zeros(1000, dtype=wp.float32, device="cuda:0") # Original allocator is restored here +.. _managed_memory_allocation_options: + +Managed Memory Allocator +~~~~~~~~~~~~~~~~~~~~~~~~ + +Managed memory is CUDA-managed storage that can be addressed from CPU and GPU +code. CUDA Unified Memory manages page placement and migration, so pages may move +between CPU and GPU memory as different processors touch them. Unlike pinned CPU +memory, which remains host memory that a GPU may access through a host mapping, +managed memory gives Warp arrays a different tradeoff from the other allocation +options: + +.. list-table:: + :header-rows: 1 + :widths: 18 29 27 26 + + * - Allocation option + - Residency and migration + - CPU/GPU access + - Typical use + * - Default CUDA + - Device memory with no automatic CPU/GPU migration. + - CUDA kernels access it directly; CPU code uses explicit copies. + - General GPU arrays when CPU access is staged explicitly. + * - CUDA mempool + - Device memory from CUDA's stream-ordered pool, with no automatic CPU/GPU + migration. + - Same CPU/GPU access rules as default CUDA memory, with separate + memory-pool access controls for peer GPUs. + - Faster repeated CUDA allocations and graph-captured allocation when + supported. + * - Pinned CPU + - Host memory that does not migrate into device memory as an allocation. + - CPU code accesses it directly; CUDA devices with unified virtual + addressing can access it through a host mapping. + - Asynchronous CPU/GPU copies or zero-copy access to small host-resident + data. + * - CUDA managed + - CUDA Unified Memory whose pages may migrate between CPU and GPU memory. + - CPU and GPU access follow CUDA managed-memory support and synchronization + rules. + - Sharing data across CPU/GPU code when migration is preferable to manual + copies. + +:class:`ManagedAllocator` creates CUDA managed-memory arrays through Warp's +allocator interface. Managed arrays keep their CUDA device metadata, but +``wp.can_access()`` and checked launch validation use CUDA managed-memory access +rules for them instead of peer-access or memory-pool-access rules. + +One major reason to choose this allocator is CPU/GPU shared work: on systems +where CUDA reports compatible managed-memory access, CPU kernels can directly +read and write managed CUDA arrays instead of maintaining a separate CPU copy. +Standard Warp CUDA arrays remain non-managed and still require explicit copies +before CPU code accesses them. + +The allocator object is not bound to one CUDA device and can be constructed +before choosing a CUDA device. Warp invokes it under the target device's CUDA +context, which must support CUDA managed memory, and records that context as +the owner for each pointer: + +.. code:: python + + managed = wp.ManagedAllocator() + device = wp.get_device("cuda:0") + + with wp.ScopedAllocator(device, managed): + a = wp.zeros(1000, dtype=wp.float32, device=device) + +Constructing a :class:`ManagedAllocator` does not promise that pages initially +reside in any device's physical memory, and it does not bypass the device's +managed-memory capability check. The CUDA device used for each allocation +identifies the owner context and array device metadata; CUDA Unified Memory +manages physical placement and migration. + +Use :attr:`array.memory_kind ` to inspect the observed +memory class backing a concrete :class:`warp.array`: + +.. code:: python + + if a.memory_kind is wp.MemoryKind.CUDA_MANAGED: + ... + +The memory kind describes the pointer's memory class as observed by Warp, and +for CUDA arrays by CUDA pointer attributes. It does not describe the current +physical residency of CUDA managed memory, and views report the memory kind of +their owner array. Indexed arrays do not expose a single memory kind because +their data and index arrays may have different backing allocations. + +To use managed memory as a persistent allocator for all CUDA devices, install one +allocator instance with :func:`set_cuda_allocator`: + +.. code:: python + + managed = wp.ManagedAllocator() + wp.set_cuda_allocator(managed) + +If only some CUDA devices should use managed memory, install the same allocator +with :func:`set_device_allocator` on those devices. A single allocator instance +can serve multiple CUDA devices, but allocation fails clearly on any target +device that does not report CUDA managed-memory support. + +Direct calls to ``ManagedAllocator.allocate()`` require an active CUDA context. +Array factory functions such as :func:`zeros` and :func:`empty` pass the target +device context automatically and perform the same managed-memory support check. + +Managed allocations currently have a CUDA graph-capture limitation in Warp: +:class:`ManagedAllocator` does not allocate a new array while CUDA graph capture +is active. If you need managed arrays with CUDA graphs, allocate them before +capture begins and reuse the existing arrays inside the captured work. This is +an implementation limitation, not a restriction on using pre-existing managed +arrays in captured work. Separately, :class:`ManagedAllocator`-managed arrays +cannot be exported with ``array.ipc_handle()``; IPC export is unsupported for +managed arrays. If IPC is required, choose a different allocator for shared data +or pre-allocate and export device arrays before switching allocator state. + +CPU access to managed arrays is hardware-dependent. Use :func:`can_access` to +check a specific managed array before CPU code reads or writes it directly: + +.. code:: python + + if wp.can_access("cpu", a): + wp.launch(cpu_kernel, dim=a.size, inputs=[a], device="cpu") + else: + a_cpu = a.to("cpu") + wp.launch(cpu_kernel, dim=a_cpu.size, inputs=[a_cpu], device="cpu") + Writing a Custom Allocator ~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/deep_dive/memory_access.rst b/docs/deep_dive/memory_access.rst index c390cdef95..7894f86c5b 100644 --- a/docs/deep_dive/memory_access.rst +++ b/docs/deep_dive/memory_access.rst @@ -37,19 +37,20 @@ the device that performs the access. Device capability properties ---------------------------- -Each device exposes three CPU/GPU memory access properties: +Each device exposes CPU/GPU memory access and managed-memory properties: - :attr:`Device.is_cpu_memory_access_from_gpu_supported ` - :attr:`Device.is_gpu_memory_access_from_cpu_supported ` -- :attr:`Device.is_cpu_gpu_atomic_supported ` +- :attr:`Device.is_managed_memory_supported ` +- :attr:`Device.is_concurrent_managed_access_supported ` This deep dive focuses on how those capabilities affect cross-device launches, -managed memory, atomics, and diagnostics. +managed memory, and diagnostics. On CPU devices, these properties are always ``False``. On GPU devices, each -property describes a specific access path or operation; support for one does not -imply support for another. For example, a system can allow GPU access to CPU -memory without allowing CPU access to GPU-resident managed memory. +property describes a specific access path or allocation feature; support for one +does not imply support for another. For example, a system can allow GPU access +to CPU memory without allowing CPU access to GPU-resident managed memory. Common CPU/GPU memory models @@ -61,28 +62,23 @@ advanced users commonly need to reason about: .. list-table:: :header-rows: 1 - :widths: 24 25 25 26 + :widths: 28 36 36 * - System model - GPU access to CPU arrays - CPU access to GPU-resident managed memory - - CPU/GPU atomics * - Discrete GPU without HMM - Usually no - Usually no - - Usually no * - Discrete GPU with Linux HMM - Yes - Usually no - - Usually no * - Jetson Thor-style ATS - Yes - Platform-dependent for managed memory - - Yes, when reported by the driver * - Host-page-table ATS with distinct CPU/GPU physical memory - Yes - Only when reported by the driver - - Yes, when reported by the driver HMM stands for Heterogeneous Memory Management; for background, see NVIDIA's `HMM overview `__. @@ -136,14 +132,6 @@ CPU access to GPU-resident managed memory is a separate capability: if device.is_gpu_memory_access_from_cpu_supported: ... -.. important:: - - ``device.is_gpu_memory_access_from_cpu_supported`` reports a hardware - capability for CUDA managed memory. Warp exposes the property today, but - standard Warp CUDA arrays are not managed-memory allocations. Until Warp - provides managed-memory allocation APIs, copy CUDA arrays to ``"cpu"`` before - CPU code reads or writes them. - CUDA arrays created by standard Warp array constructors, such as :func:`zeros`, :func:`empty`, and :func:`ones`, are not CUDA managed-memory allocations. This is true whether the array comes from Warp's :ref:`mempool @@ -156,6 +144,54 @@ those arrays, use an explicit copy before CPU code reads or writes the data: a_cpu = a.to("cpu") wp.launch(cpu_kernel, dim=a_cpu.size, inputs=[a_cpu], device="cpu") +For explicit CUDA managed-memory arrays, construct a :class:`ManagedAllocator` +and install it with the existing allocator APIs. The allocator instance is not +bound to one CUDA device, but each allocation still happens under the target +device's CUDA context and that device must report CUDA managed-memory support: + +.. code:: python + + managed = wp.ManagedAllocator() + device = wp.get_device("cuda:0") + + with wp.ScopedAllocator(device, managed): + a = wp.zeros(1024, dtype=float, device=device) + +Managed arrays remain CUDA arrays in Warp: ``a.device`` is still ``"cuda:0"``, +and CUDA Unified Memory manages physical page migration. For concrete arrays, +:attr:`array.memory_kind ` reports +``wp.MemoryKind.CUDA_MANAGED``. This reports the observed pointer memory class, +not current physical residency. See +:ref:`the allocator comparison ` for how +managed memory differs from Warp's default CUDA, CUDA mempool, and pinned CPU +allocation options. + +This is the opt-in Warp allocation path for CPU kernels that need to operate +directly on CUDA-side data without maintaining a separate CPU copy. Ordinary +Warp CUDA arrays remain non-managed allocations and still need explicit copies +before CPU kernels read or write them. + +Managed arrays can be used by kernels captured in CUDA graphs when the arrays +are allocated before capture begins. In the current Warp implementation, +allocating a new managed array while CUDA graph capture is active is not +supported, so create managed arrays before capture and reuse them inside the +captured work. + +Use :func:`can_access` before CPU code directly reads or writes a managed array: + +.. code:: python + + if wp.can_access("cpu", a): + wp.launch(cpu_kernel, dim=a.size, inputs=[a], device="cpu") + else: + a_cpu = a.to("cpu") + wp.launch(cpu_kernel, dim=a_cpu.size, inputs=[a_cpu], device="cpu") + +``wp.can_access("cpu", a)`` returns ``True`` for a managed CUDA array only when +the owning CUDA device reports concurrent managed access or direct CPU access to +GPU memory. On limited managed-memory systems, Warp returns ``False`` because it +cannot prove that a direct CPU access is synchronized with GPU use. + Do not assume that GPU access to CPU memory implies CPU access to GPU-resident memory. Some systems support the former but not the latter. @@ -176,17 +212,25 @@ can directly access a specific Warp array: For CPU arrays passed to CUDA kernels, pinned CPU arrays are accepted on CUDA devices with unified virtual addressing, and unpinned CPU arrays require -``is_cpu_memory_access_from_gpu_supported``. For CUDA arrays, default CUDA -allocations use CUDA peer-access state, while memory pool allocations use +``is_cpu_memory_access_from_gpu_supported``. For CUDA arrays, managed-memory +allocations use CUDA managed-memory support on the launch device, default CUDA +allocations use CUDA peer-access state, and memory pool allocations use memory-pool access state. See :ref:`mempool_access` for the distinction between peer access for default CUDA allocations and memory-pool access for mempool allocations. +If you need to inspect what kind of memory backs a concrete :class:`warp.array`, +use :attr:`array.memory_kind `. Memory kind is +diagnostic information and an input to Warp's access checks; it does not replace +:func:`warp.can_access` for deciding whether a launch device can directly use +an array. Indexed arrays are composite resources; inspect their ``data`` and +``indices`` arrays directly when you need constituent diagnostics. + ``wp.can_access(device, array)`` returns ``False`` when Warp cannot verify that -the array is directly accessible. This includes cross-device arrays backed by -custom allocators or externally wrapped allocations whose allocation kind is not -known to Warp. A ``False`` result means "not verified accessible"; it does not -prove that the hardware could never access the pointer. +the array's memory access requirements are satisfied. This includes +unclassified pointers and classified pointers whose allocation-specific access +cannot be proven. A ``False`` result means "not verified accessible"; it does +not prove that the hardware could never access the pointer. ``wp.can_access()`` is a resource-oriented API. In this release, the second argument must be a concrete Warp array instance. Annotation-only arrays such as @@ -211,10 +255,10 @@ where no concrete array is available: For GPU kernels accessing CPU arrays, this method uses ``is_cpu_memory_access_from_gpu_supported`` because standard Warp CPU arrays use unpinned CPU memory. For CPU code accessing CUDA arrays, it returns ``False`` for -Warp CUDA arrays because the built-in CUDA allocators do not create CUDA -managed-memory allocations. For GPU/GPU pairs, it reflects the target device's -current built-in allocator mode: memory-pool access when memory pools are -enabled on the target device, and peer access otherwise. +standard Warp CUDA arrays because the built-in CUDA allocators do not create +CUDA managed-memory allocations. For GPU/GPU pairs, it reflects the target +device's current built-in allocator mode: memory-pool access when memory pools +are enabled on the target device, and peer access otherwise. ``Device.can_access()`` is not authoritative for existing arrays. An array may have been allocated before memory-pool settings changed, may use a custom @@ -248,26 +292,28 @@ If you want a clear Python error before the kernel runs, set that do not support direct CPU/GPU memory access or on multi-GPU systems where peer and memory-pool access are configured separately. -Arrays backed by custom or externally wrapped allocators are a limitation of this -diagnostic. Warp does not know the allocation kind for those arrays, so +Custom allocators and external wrappers are a limitation of this diagnostic +only when Warp cannot classify the pointer or cannot prove the specific access +requirements, such as for unowned CUDA memory-pool pointers. In those cases, ``wp.config.LaunchArrayAccessMode.CHECKED`` emits a ``UserWarning`` once per -``(kernel, argument name, source device, launch device)`` pattern and allows the -launch to proceed. Use ``wp.config.LaunchArrayAccessMode.STRICT`` if unknown allocation -provenance should be rejected, or ``wp.config.LaunchArrayAccessMode.RELAXED`` to suppress -the diagnostic. +``(kernel, argument name, source device, launch device)`` pattern and allows +the launch to proceed. Use ``wp.config.LaunchArrayAccessMode.STRICT`` if +unverified cross-device access should be rejected, or +``wp.config.LaunchArrayAccessMode.RELAXED`` to suppress the diagnostic. Objects exposing ``__array_interface__`` are accepted only for CPU launches. Warp treats that protocol as a CPU-addressable pointer and does not infer CUDA -allocation provenance from it, so ``wp.config.LaunchArrayAccessMode.CHECKED`` has no +memory kind from it, so ``wp.config.LaunchArrayAccessMode.CHECKED`` has no cross-device access decision to make for that protocol. Directly passing an object that exposes ``__cuda_array_interface__`` is different from passing a Warp array. The protocol lets Warp construct the kernel argument at launch time, but it does not identify the allocation device or -allocation kind. In this phase, ``wp.config.LaunchArrayAccessMode.CHECKED`` does not fully -verify directly passed objects exposing this protocol. Advanced users who know -such an allocation is valid are responsible for ensuring that the launch device -can legally access the pointer. +allocation-specific access state. In this phase, +``wp.config.LaunchArrayAccessMode.CHECKED`` does not fully verify directly +passed objects exposing this protocol. Advanced users who know such an +allocation is valid are responsible for ensuring that the launch device can +legally access the pointer. .. code:: python @@ -290,75 +336,6 @@ alive, CPU updates made between replays are visible to kernels on devices that can access CPU memory. -Checking CPU/GPU atomic support -------------------------------- - -Direct loads and stores do not imply atomic safety. Code that uses atomics from -both CPU and GPU code paths on the same allocation should also check -:attr:`Device.is_cpu_gpu_atomic_supported `: - -.. code:: python - - device = wp.get_device("cuda:0") - - if not device.is_cpu_gpu_atomic_supported: - raise RuntimeError("This algorithm requires CPU/GPU atomic support") - -:attr:`Device.is_cpu_gpu_atomic_supported ` -answers only whether CPU/GPU atomic operations are supported for an otherwise -accessible allocation. The allocation must still be accessible from both the CPU -and GPU, and the program must provide any required synchronization. - -For example, GPU atomics into a CPU allocation require both GPU access to CPU -memory and CPU/GPU atomic support: - -.. code:: python - - device = wp.get_device("cuda:0") - counters = wp.zeros(1, dtype=wp.int32, device="cpu") - - if ( - device.is_cpu_memory_access_from_gpu_supported - and device.is_cpu_gpu_atomic_supported - ): - wp.launch(update_counters, dim=n, inputs=[counters], device=device) - wp.synchronize_device(device) - print(counters.numpy()[0]) - -The same requirements apply when CPU and GPU work overlap. If a CPU kernel and a -GPU kernel both write the same allocation concurrently, all conflicting accesses -must use atomic operations, and the device must report CPU/GPU atomic support. -Atomicity prevents lost updates, but it does not provide a deterministic -ordering for non-commutative operations or floating-point accumulation: - -.. code:: python - - # Assume both kernels call wp.atomic_add(counters, 0, 1) once per thread. - counters = wp.zeros(1, dtype=wp.int32, device="cpu") - - if ( - device.is_cpu_memory_access_from_gpu_supported - and device.is_cpu_gpu_atomic_supported - ): - wp.launch(gpu_increment, dim=num_gpu_threads, inputs=[counters], device=device) - wp.launch(cpu_increment, dim=num_cpu_threads, inputs=[counters], device="cpu") - - wp.synchronize_device(device) - assert counters.numpy()[0] == num_gpu_threads + num_cpu_threads - -If :attr:`Device.is_cpu_gpu_atomic_supported ` -is ``False``, do not rely on concurrent CPU/GPU atomics, even on systems where -the GPU can directly load and store CPU memory. - -That does not make non-managed CUDA allocations CPU-accessible. CPU code should -still copy arrays backed by those allocations before reading or writing them: - -.. code:: python - - values = wp.zeros(1024, dtype=float, device=device) - values_cpu = values.to("cpu") - - Choosing a memory access pattern -------------------------------- @@ -374,11 +351,14 @@ allocation or access pattern to create: ``device.is_uva``. - CPU code reads or writes arrays backed by non-managed CUDA allocations: copy the data to ``"cpu"`` first. +- CPU kernels read or write Warp CUDA arrays directly: allocate those arrays + with :class:`ManagedAllocator` and use ``wp.can_access("cpu", array)`` before + launching the CPU kernel. - CPU code accesses externally provided GPU-resident CUDA managed memory: check ``device.is_gpu_memory_access_from_cpu_supported``. -- CPU and GPU both use atomics on the same allocation: make sure the allocation - is accessible from both the CPU and GPU, and check - ``device.is_cpu_gpu_atomic_supported``. +- CPU and GPU both need to update the same allocation: sequence ownership with + synchronization or use separate buffers. Do not assume Warp atomics make + overlapping CPU/GPU updates safe. - GPU kernels use arrays from another GPU: enable peer access for default CUDA allocations, or :ref:`memory-pool access ` for CUDA memory-pool allocations, then check the concrete array with diff --git a/docs/user_guide/limitations.rst b/docs/user_guide/limitations.rst index 383a312007..18a9482ed7 100644 --- a/docs/user_guide/limitations.rst +++ b/docs/user_guide/limitations.rst @@ -25,6 +25,8 @@ Kernels and User Functions * :func:`wp.atomic_add() ` does not support :class:`wp.float16 ` or :class:`wp.bfloat16 ` on GPUs with compute capability below 7.0. On such devices, the function will return ``0.0`` without modifying the target memory. +* Using ``wp.atomic_add()`` or related functions on the same memory address from + overlapping CPU and GPU kernels is currently unsupported. * :func:`wp.tid() ` cannot be called from user functions. * Modifying the value of a :class:`wp.constant() ` during runtime will not trigger recompilation of the affected kernels if the modules have already been loaded @@ -66,11 +68,12 @@ Arrays * There are currently no data types that support complex numbers. * ``wp.config.launch_array_access_mode = wp.config.LaunchArrayAccessMode.CHECKED`` only fully verifies cross-device :class:`wp.array ` arguments when - Warp can determine the allocation kind. Arrays backed by custom or externally - wrapped allocators warn and proceed in checked mode; use + Warp can classify the pointer and prove the relevant access requirements. + Custom arrays or external wrappers whose pointer kind or specific access state + cannot be verified warn and proceed in checked mode; use ``wp.config.LaunchArrayAccessMode.STRICT`` to reject cross-device launches before - checking allocator provenance. Directly passed ``__array_interface__`` or - ``__cuda_array_interface__`` objects are not fully allocation-verified. See + checking access. Directly passed ``__array_interface__`` or + ``__cuda_array_interface__`` objects are not fully access-verified. See :ref:`launch_array_access_checks` for details. Structs diff --git a/warp/__init__.py b/warp/__init__.py index 4961fb8230..475148e07d 100644 --- a/warp/__init__.py +++ b/warp/__init__.py @@ -335,7 +335,9 @@ from warp._src.context import is_peer_access_enabled as is_peer_access_enabled from warp._src.context import set_peer_access_enabled as set_peer_access_enabled +from warp._src.context import MemoryKind as MemoryKind from warp._src.context import Allocator as Allocator +from warp._src.context import ManagedAllocator as ManagedAllocator from warp._src.context import get_device_allocator as get_device_allocator from warp._src.context import set_cuda_allocator as set_cuda_allocator from warp._src.context import set_device_allocator as set_device_allocator diff --git a/warp/__init__.pyi b/warp/__init__.pyi index 8a85b58c73..b67e2533c6 100644 --- a/warp/__init__.pyi +++ b/warp/__init__.pyi @@ -168,7 +168,9 @@ from warp._src.context import set_mempool_access_enabled as set_mempool_access_e from warp._src.context import is_peer_access_supported as is_peer_access_supported from warp._src.context import is_peer_access_enabled as is_peer_access_enabled from warp._src.context import set_peer_access_enabled as set_peer_access_enabled +from warp._src.context import MemoryKind as MemoryKind from warp._src.context import Allocator as Allocator +from warp._src.context import ManagedAllocator as ManagedAllocator from warp._src.context import get_device_allocator as get_device_allocator from warp._src.context import set_cuda_allocator as set_cuda_allocator from warp._src.context import set_device_allocator as set_device_allocator diff --git a/warp/_src/context.py b/warp/_src/context.py index 4ecb5ad718..020b978088 100644 --- a/warp/_src/context.py +++ b/warp/_src/context.py @@ -3557,6 +3557,44 @@ class _ArrayAccessStatus(enum.Enum): UNKNOWN = enum.auto() +class MemoryKind(enum.Enum): + """Observed memory kind backing a Warp array. + + This describes the memory class reported by Warp and CUDA pointer + attributes. It does not describe current physical residency, migration + state, synchronization state, or whether a specific device can access the + array. Use :func:`can_access` for access checks. + """ + + HOST = "host" + """Default host memory.""" + + PINNED_HOST = "pinned_host" + """Pinned or CUDA-registered host memory.""" + + CUDA_DEVICE = "cuda_device" + """CUDA device memory that is not classified as managed or mempool memory.""" + + CUDA_MEMPOOL = "cuda_mempool" + """CUDA memory-pool allocation.""" + + CUDA_MANAGED = "cuda_managed" + """CUDA managed-memory allocation.""" + + UNKNOWN = "unknown" + """Memory kind that Warp cannot classify.""" + + +_NATIVE_MEMORY_KIND_TO_MEMORY_KIND = { + 0: MemoryKind.UNKNOWN, + 1: MemoryKind.HOST, + 2: MemoryKind.PINNED_HOST, + 3: MemoryKind.CUDA_DEVICE, + 4: MemoryKind.CUDA_MEMPOOL, + 5: MemoryKind.CUDA_MANAGED, +} + + _LAUNCH_ARRAY_ACCESS_WARNING_CACHE_SIZE = 1024 _launch_array_access_warnings_seen: collections.OrderedDict[tuple[str, str, str, str], None] = collections.OrderedDict() @@ -3717,7 +3755,80 @@ def allocate(self, size_in_bytes): return ptr def deallocate(self, ptr, size_in_bytes): - runtime.core.wp_free_device_async(self.device.context, ptr, None) + if not runtime.core.wp_free_device_async(self.device.context, ptr, None): + native_error = runtime.get_error_string() + reason = f": {native_error}" if native_error else "" + raise RuntimeError(f"Failed to free memory on device '{self.device}'{reason}") + + +class ManagedAllocator: + """Allocator that creates CUDA managed-memory arrays. + + The allocator object is not bound to one CUDA device and can be reused + across devices. Each allocation still happens under a specific CUDA context, + and that context's device must support CUDA managed memory. Warp pushes the + target CUDA context before invoking ``allocate()``, and the allocator records + that context as the owner for each pointer so it can free allocations + through the same CUDA context later. Direct calls to ``allocate()`` require + an active CUDA context. + + Managed allocation uses ``cudaMallocManaged()`` and cannot occur while CUDA + graph capture is active. Allocate managed arrays before capture begins and + reuse the existing arrays inside captured work. + """ + + def __init__(self): + self._contexts_by_ptr: dict[int, int] = {} + self._lock = threading.Lock() + + def allocate(self, size_in_bytes): + init() + + context = runtime.core.wp_cuda_context_get_current() + if not context: + raise RuntimeError("ManagedAllocator.allocate() requires an active CUDA context") + + device = runtime.get_current_cuda_device() + if not device.is_cuda: + raise RuntimeError("ManagedAllocator requires a current CUDA device") + if not device.is_managed_memory_supported: + raise RuntimeError(f"ManagedAllocator requires CUDA managed memory support on device '{device}'") + + ptr = runtime.core.wp_alloc_device_managed(context, size_in_bytes, None) + if not ptr: + reason = "" + native_error = runtime.get_error_string() + if native_error: + reason = f": {native_error}" + if not reason and device.is_capturing: + reason = ( + ": Warp error: managed allocation during CUDA graph capture is not supported. " + "Allocate before capture on this device." + ) + raise RuntimeError( + f"Failed to allocate {size_in_bytes} bytes with ManagedAllocator on device '{device}'{reason}" + ) + + _set_alloc_tag_if_tracking(ptr) + with self._lock: + self._contexts_by_ptr[ptr] = context + return ptr + + def deallocate(self, ptr, size_in_bytes): + init() + + if not ptr: + return + + with self._lock: + context = self._contexts_by_ptr.pop(ptr, None) + if context is None: + raise RuntimeError("ManagedAllocator cannot free a pointer it did not allocate") + + if not runtime.core.wp_free_device_managed(context, ptr): + with self._lock: + self._contexts_by_ptr[ptr] = context + raise RuntimeError(runtime.get_error_string()) class ContextGuard: @@ -4049,8 +4160,14 @@ class Device: memory physically resident on this device without migration. This does not imply that Warp arrays allocated on CUDA devices are CPU-accessible: Warp's built-in CUDA allocators do not create CUDA managed-memory allocations. ``False`` for CPU devices. - is_cpu_gpu_atomic_supported (bool): Indicates whether native atomic operations between CPU and GPU memory - are supported on this device. ``False`` for CPU devices. + is_cpu_gpu_atomic_supported (bool): Indicates whether the CUDA device reports native CPU/GPU atomic + hardware capability. This is not a guarantee that Warp ``wp.atomic_*`` operations can be used + concurrently from CPU and GPU kernels; current CPU-side Warp atomics are not hardware atomics. + ``False`` for CPU devices. + is_managed_memory_supported (bool): Indicates whether the CUDA device supports managed-memory allocations. + ``False`` for CPU devices. + is_concurrent_managed_access_supported (bool): Indicates whether the CUDA device supports concurrent managed + memory access. ``False`` for CPU devices. is_cubin_supported (bool): Indicates whether Warp's version of NVRTC can directly generate CUDA binary files (cubin) for this device's architecture. ``False`` for CPU devices. is_mempool_supported (bool): Indicates whether the device supports using the ``cuMemAllocAsync`` and @@ -4100,6 +4217,8 @@ def __init__(self, runtime, alias, ordinal=-1, is_primary=False, context=None): self.is_cpu_memory_access_from_gpu_supported = False self.is_gpu_memory_access_from_cpu_supported = False self.is_cpu_gpu_atomic_supported = False + self.is_managed_memory_supported = False + self.is_concurrent_managed_access_supported = False self.is_mempool_supported = False self.is_mempool_enabled = False self.is_ipc_supported = False # TODO: Support IPC for CPU arrays @@ -4128,6 +4247,10 @@ def __init__(self, runtime, alias, ordinal=-1, is_primary=False, context=None): runtime.core.wp_cuda_device_get_direct_managed_mem_access_from_host(ordinal) > 0 ) self.is_cpu_gpu_atomic_supported = runtime.core.wp_cuda_device_get_host_native_atomic_supported(ordinal) > 0 + self.is_managed_memory_supported = runtime.core.wp_cuda_device_get_managed_memory(ordinal) > 0 + self.is_concurrent_managed_access_supported = ( + runtime.core.wp_cuda_device_get_concurrent_managed_access(ordinal) > 0 + ) self.is_mempool_supported = runtime.core.wp_cuda_device_is_mempool_supported(ordinal) > 0 if platform.system() == "Linux": # Use None when IPC support cannot be determined @@ -4384,7 +4507,7 @@ def can_access(self, other): """Return whether this device can access the current built-in allocator for another device. This is a coarse device-level query. It does not inspect a specific allocation, so it does not answer - whether an existing array can be accessed. Use :func:`warp.can_access` when allocation-specific Warp array + whether an existing array can be accessed. Use :func:`warp.can_access` when memory-kind-specific Warp array logic is needed, such as for pinned CPU arrays or CUDA memory-pool allocations. """ @@ -4917,6 +5040,8 @@ def __init__(self): self.core.wp_alloc_device_default.restype = ctypes.c_void_p self.core.wp_alloc_device_async.argtypes = [ctypes.c_void_p, ctypes.c_size_t, ctypes.c_char_p] self.core.wp_alloc_device_async.restype = ctypes.c_void_p + self.core.wp_alloc_device_managed.argtypes = [ctypes.c_void_p, ctypes.c_size_t, ctypes.c_char_p] + self.core.wp_alloc_device_managed.restype = ctypes.c_void_p self.core.wp_float_to_half_bits.argtypes = [ctypes.c_float] self.core.wp_float_to_half_bits.restype = ctypes.c_uint16 @@ -4937,7 +5062,9 @@ def __init__(self): self.core.wp_free_device_default.argtypes = [ctypes.c_void_p, ctypes.c_void_p] self.core.wp_free_device_default.restype = None self.core.wp_free_device_async.argtypes = [ctypes.c_void_p, ctypes.c_void_p, ctypes.c_void_p] - self.core.wp_free_device_async.restype = None + self.core.wp_free_device_async.restype = ctypes.c_bool + self.core.wp_free_device_managed.argtypes = [ctypes.c_void_p, ctypes.c_void_p] + self.core.wp_free_device_managed.restype = ctypes.c_bool self.core.wp_alloc_tracker_enable.argtypes = [ctypes.c_int] self.core.wp_alloc_tracker_enable.restype = None @@ -5856,6 +5983,12 @@ def __init__(self): self.core.wp_cuda_device_get_direct_managed_mem_access_from_host.restype = ctypes.c_int self.core.wp_cuda_device_get_host_native_atomic_supported.argtypes = [ctypes.c_int] self.core.wp_cuda_device_get_host_native_atomic_supported.restype = ctypes.c_int + self.core.wp_cuda_device_get_managed_memory.argtypes = [ctypes.c_int] + self.core.wp_cuda_device_get_managed_memory.restype = ctypes.c_int + self.core.wp_cuda_device_get_concurrent_managed_access.argtypes = [ctypes.c_int] + self.core.wp_cuda_device_get_concurrent_managed_access.restype = ctypes.c_int + self.core.wp_cuda_pointer_get_memory_kind.argtypes = [ctypes.c_void_p, ctypes.c_void_p] + self.core.wp_cuda_pointer_get_memory_kind.restype = ctypes.c_int self.core.wp_cuda_device_is_mempool_supported.argtypes = [ctypes.c_int] self.core.wp_cuda_device_is_mempool_supported.restype = ctypes.c_int self.core.wp_cuda_device_is_ipc_supported.argtypes = [ctypes.c_int] @@ -8176,11 +8309,42 @@ def _get_array_allocator(value: warp.array) -> Allocator | None: return None +def _get_array_memory_kind(value: warp.array) -> MemoryKind: + """Return observed memory kind for ``value``, following views to their owner.""" + + while warp._src.types.is_array(value): + ref = getattr(value, "_ref", None) + if ref is not None and warp._src.types.is_array(ref): + value = ref + continue + + device = getattr(value, "device", None) + if device is None: + return MemoryKind.UNKNOWN + + if device.is_cpu: + return MemoryKind.PINNED_HOST if getattr(value, "pinned", False) else MemoryKind.HOST + + ptr = getattr(value, "ptr", None) + if ptr is None: + return MemoryKind.UNKNOWN + + if device.is_cuda: + with device.context_guard: + native_kind = runtime.core.wp_cuda_pointer_get_memory_kind(device.context, ptr) + return _NATIVE_MEMORY_KIND_TO_MEMORY_KIND.get(native_kind, MemoryKind.UNKNOWN) + + return MemoryKind.UNKNOWN + + return MemoryKind.UNKNOWN + + def can_access(device: DeviceLike, resource) -> bool: """Return whether ``device`` can directly access ``resource``. - In this release, ``resource`` must be a concrete Warp array. The query is allocation-aware for built-in Warp - allocators and returns ``False`` for cross-device allocations whose access rules cannot be verified. + In this release, ``resource`` must be a concrete Warp array. The query uses the resource's observed memory kind + where available, including CUDA managed, CUDA device, CUDA mempool, and pinned host memory, and returns ``False`` + when access cannot be verified. Args: device: The device that needs to access ``resource``. @@ -8244,7 +8408,7 @@ def _classify_single_array_access_from_device(value: warp.array, device: Device) if device.context == value_device.context: return _ArrayAccessStatus.ACCESSIBLE - allocator = _get_array_allocator(value) + memory_kind = _get_array_memory_kind(value) if device.is_cuda and value_device.is_cpu: if value.pinned and device.is_uva: @@ -8254,39 +8418,68 @@ def _classify_single_array_access_from_device(value: warp.array, device: Device) return _ArrayAccessStatus.INACCESSIBLE if device.is_cpu and value_device.is_cuda: - if isinstance(allocator, CudaDefaultAllocator | CudaMempoolAllocator): - # Warp's CUDA arrays are not CUDA managed-memory allocations. + if memory_kind == MemoryKind.CUDA_MANAGED: + if ( + value_device.is_concurrent_managed_access_supported + or value_device.is_gpu_memory_access_from_cpu_supported + ): + return _ArrayAccessStatus.ACCESSIBLE + return _ArrayAccessStatus.INACCESSIBLE + if memory_kind in (MemoryKind.CUDA_DEVICE, MemoryKind.CUDA_MEMPOOL): return _ArrayAccessStatus.INACCESSIBLE - # Custom and externally wrapped CUDA allocations may use allocation - # kinds that Warp cannot classify yet. + # Custom and externally wrapped CUDA allocations may use memory kinds + # that Warp cannot classify yet. return _ArrayAccessStatus.UNKNOWN if device.is_cuda and value_device.is_cuda: - if isinstance(allocator, CudaMempoolAllocator): + if memory_kind == MemoryKind.CUDA_MANAGED: + if device.is_managed_memory_supported and value_device.is_managed_memory_supported: + return _ArrayAccessStatus.ACCESSIBLE + return _ArrayAccessStatus.INACCESSIBLE + if memory_kind == MemoryKind.CUDA_MEMPOOL: + allocator = _get_array_allocator(value) + if not isinstance(allocator, CudaMempoolAllocator): + # is_mempool_access_enabled() checks Warp's/default device pool. Externally wrapped or custom + # mempool pointers may come from another pool whose access flags Warp did not query. + return _ArrayAccessStatus.UNKNOWN if is_mempool_access_enabled(value_device, device): return _ArrayAccessStatus.ACCESSIBLE return _ArrayAccessStatus.INACCESSIBLE - if isinstance(allocator, CudaDefaultAllocator): + if memory_kind == MemoryKind.CUDA_DEVICE: if is_peer_access_enabled(value_device, device): return _ArrayAccessStatus.ACCESSIBLE return _ArrayAccessStatus.INACCESSIBLE - # Custom and externally wrapped allocations do not expose enough - # information for launch array access checks to choose the correct access API. + # Unclassified memory kinds do not expose enough information for launch + # array access checks to choose the correct access API. return _ArrayAccessStatus.UNKNOWN return _ArrayAccessStatus.INACCESSIBLE +def _format_array_memory_kind(value: warp.array) -> str: + kind_names = [] + for backing_array in _iter_array_access_backing_arrays(value): + kind_name = _get_array_memory_kind(backing_array).name + if kind_name not in kind_names: + kind_names.append(kind_name) + + if len(kind_names) == 1: + return f"memory_kind={kind_names[0]}" + + return f"memory_kinds=[{', '.join(kind_names)}]" + + def _raise_launch_array_access_error(kernel, arg_name: str, value: warp.array, device: Device) -> None: + memory_kind = _format_array_memory_kind(value) raise RuntimeError( f"Error launching kernel '{kernel.key}', trying to launch on device='{device}', " - f"but input array for argument '{arg_name}' is on device={value.device}, " - f"whose array allocation is not accessible or cannot be verified as accessible from " - f"'{device}'. Move the array to '{device}', enable the required peer/coherent access, " + f"but input array for argument '{arg_name}' is on device={value.device} ({memory_kind}), " + f"whose memory is not accessible or cannot be verified as accessible from " + f"'{device}'. Move the array to '{device}', enable the required peer/mempool/coherent access, " f"or set warp.config.launch_array_access_mode = warp.config.LaunchArrayAccessMode.RELAXED " - f"only if this launch is valid for the hardware and allocation type." + f"only if this launch is valid for the hardware and memory kind." ) @@ -8302,9 +8495,9 @@ def _warn_unknown_launch_array_access(kernel, arg_name: str, value: warp.array, log_warning( f"warp.config.LaunchArrayAccessMode.CHECKED cannot verify cross-device access for kernel '{kernel.key}' " - f"argument '{arg_name}' from device={value.device} to launch device='{device}': the array uses " - "an unknown allocator or externally wrapped allocation. The launch will proceed but may result " - "in errors. Use warp.config.LaunchArrayAccessMode.STRICT to reject this launch, or " + f"argument '{arg_name}' from device={value.device} to launch device='{device}' " + f"({_format_array_memory_kind(value)}): Warp cannot verify access requirements for this memory kind or allocation. " + "The launch will proceed but may result in errors. Use warp.config.LaunchArrayAccessMode.STRICT to reject this launch, or " "warp.config.LaunchArrayAccessMode.RELAXED to suppress this diagnostic.", category=UserWarning, stacklevel=3, diff --git a/warp/_src/types.py b/warp/_src/types.py index 74aff6176c..c8ba1b99b7 100644 --- a/warp/_src/types.py +++ b/warp/_src/types.py @@ -3660,8 +3660,15 @@ def __del__(self): if not hasattr(self, "device") or self.device is None or self.ptr is None: return try: - with self.device.context_guard: + managed_allocator_type = getattr(warp, "ManagedAllocator", None) + if managed_allocator_type is not None and isinstance( + getattr(self, "_allocator", None), managed_allocator_type + ): + # ManagedAllocator deallocates through its recorded owner context. self.deleter(self.ptr, self.capacity) + else: + with self.device.context_guard: + self.deleter(self.ptr, self.capacity) except (TypeError, AttributeError): # Suppress TypeError and AttributeError when callables become None during shutdown pass @@ -3786,6 +3793,11 @@ def __str__(self): def __repr__(self): return type_repr(self) + @property + def memory_kind(self): + """Observed memory kind backing this array.""" + return warp._src.context._get_array_memory_kind(self) + def __getitem__(self, key): if isinstance(key, int): if self.ndim == 1: @@ -4514,18 +4526,30 @@ def ipc_handle(self) -> bytes: RuntimeError: The array is not associated with a CUDA device. RuntimeError: The CUDA device does not appear to support IPC. RuntimeError: The array was allocated using the :ref:`mempool memory allocator `. + RuntimeError: The array was allocated using the managed-memory allocator. + RuntimeError: The array wraps external CUDA memory. + RuntimeError: The array is a view into another allocation. """ if self.device is None or not self.device.is_cuda: raise RuntimeError("IPC requires a CUDA device") + + memory_kind = warp._src.context._get_array_memory_kind(self) + if memory_kind == warp._src.context.MemoryKind.CUDA_MANAGED: + raise RuntimeError("IPC is not supported for managed-memory arrays") elif self.device.is_ipc_supported is False: raise RuntimeError("IPC does not appear to be supported on this CUDA device") - elif isinstance(self._allocator, warp._src.context.CudaMempoolAllocator): + + if memory_kind == warp._src.context.MemoryKind.CUDA_MEMPOOL: raise RuntimeError( "Currently, IPC is only supported for arrays using the default memory allocator.\n" "See https://nvidia.github.io/warp/stable/deep_dive/allocators.html for instructions on how to disable\n" f"the mempool allocator on device {self.device}." ) + elif getattr(self, "_ref", None) is not None: + raise RuntimeError("IPC is not supported for array views") + elif not isinstance(getattr(self, "_allocator", None), warp._src.context.CudaDefaultAllocator): + raise RuntimeError("IPC is not supported for externally wrapped arrays") # Allocate a buffer for the data (64-element char array) ipc_handle_buffer = (ctypes.c_char * 64)() diff --git a/warp/native/warp.cpp b/warp/native/warp.cpp index 8e9cbbb7e9..e2affb2f13 100644 --- a/warp/native/warp.cpp +++ b/warp/native/warp.cpp @@ -979,11 +979,20 @@ void* wp_alloc_device_default(void* context, size_t s, const char* tag) { return void* wp_alloc_device_async(void* context, size_t s, const char* tag) { return NULL; } +void* wp_alloc_device_managed(void* context, size_t s, const char* tag) { return NULL; } + void wp_free_device(void* context, void* ptr) { } void wp_free_device_default(void* context, void* ptr) { } -void wp_free_device_async(void* context, void* ptr, void** dbg_node_ret) { } +bool wp_free_device_async(void* context, void* ptr, void** dbg_node_ret) { return true; } + +bool wp_free_device_managed(void* context, void* ptr) +{ + // Match the other no-CUDA device free stubs: there is no device allocator + // state to release, so freeing any pointer is treated as a no-op. + return true; +} bool wp_memcpy_h2d(void* context, void* dest, void* src, size_t n, void* stream) { return false; } @@ -1031,6 +1040,9 @@ WP_API int wp_cuda_device_is_uva(int ordinal) { return 0; } WP_API int wp_cuda_device_get_pageable_memory_access(int ordinal) { return 0; } WP_API int wp_cuda_device_get_direct_managed_mem_access_from_host(int ordinal) { return 0; } WP_API int wp_cuda_device_get_host_native_atomic_supported(int ordinal) { return 0; } +WP_API int wp_cuda_device_get_managed_memory(int ordinal) { return 0; } +WP_API int wp_cuda_device_get_concurrent_managed_access(int ordinal) { return 0; } +WP_API int wp_cuda_pointer_get_memory_kind(void* context, void* ptr) { return WP_MEMORY_KIND_UNKNOWN; } WP_API int wp_cuda_device_is_mempool_supported(int ordinal) { return 0; } WP_API int wp_cuda_device_is_ipc_supported(int ordinal) { return 0; } WP_API int wp_cuda_device_set_mempool_release_threshold(int ordinal, uint64_t threshold) { return 0; } diff --git a/warp/native/warp.cu b/warp/native/warp.cu index efe7f3b5d9..731deada49 100644 --- a/warp/native/warp.cu +++ b/warp/native/warp.cu @@ -151,6 +151,8 @@ struct DeviceInfo { int pageable_memory_access = 0; int direct_managed_mem_access_from_host = 0; int host_native_atomic_supported = 0; + int managed_memory = 0; + int concurrent_managed_access = 0; int is_mempool_supported = 0; int sm_count = 0; int is_ipc_supported = -1; @@ -245,6 +247,11 @@ static std::unordered_map g_captures; // See wp_alloc_device_async() and wp_free_device_async(). static std::unordered_map g_graph_allocs; +static std::unordered_map g_managed_direct_allocs; +// Pointers CUDA already reclaimed while destroying their owner context. +static std::unordered_map g_destroyed_context_managed_direct_allocs; +static std::mutex g_managed_alloc_mutex; + // Memory that cannot be freed immediately gets queued here. // Call free_deferred_allocs() to release. static std::vector g_deferred_free_list; @@ -301,6 +308,12 @@ int cuda_init() check_cu(cuDeviceGetAttribute_f( &g_devices[i].host_native_atomic_supported, CU_DEVICE_ATTRIBUTE_HOST_NATIVE_ATOMIC_SUPPORTED, device )); + check_cu( + cuDeviceGetAttribute_f(&g_devices[i].managed_memory, CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY, device) + ); + check_cu(cuDeviceGetAttribute_f( + &g_devices[i].concurrent_managed_access, CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, device + )); check_cu(cuDeviceGetAttribute_f( &g_devices[i].is_mempool_supported, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, device )); @@ -405,6 +418,36 @@ static ContextInfo* get_context_info(CUcontext ctx) static inline ContextInfo* get_context_info(void* context) { return get_context_info(static_cast(context)); } +static inline CUcontext get_managed_alloc_context(void* context) +{ + CUcontext ctx = static_cast(context); + return ctx ? ctx : get_current_context(); +} + +static void mark_context_managed_direct_allocs_released(CUcontext ctx) +{ + std::vector released_ptrs; + + { + std::lock_guard lock(g_managed_alloc_mutex); + for (auto iter = g_managed_direct_allocs.begin(); iter != g_managed_direct_allocs.end();) { + if (iter->second == ctx) { + void* ptr = iter->first; + g_destroyed_context_managed_direct_allocs[ptr] = ctx; + released_ptrs.push_back(ptr); + iter = g_managed_direct_allocs.erase(iter); + } else { + ++iter; + } + } + } + + if (g_alloc_tracker.enabled) { + for (void* ptr : released_ptrs) + g_alloc_tracker.record_free(ptr); + } +} + static inline StreamInfo* get_stream_info(CUstream stream) { auto it = g_streams.find(stream); @@ -639,7 +682,8 @@ static int process_deferred_graph_destroy_callbacks(void* context = NULL) GraphAllocInfo& alloc_info = alloc_iter->second; alloc_info.graph_destroyed = true; if (!alloc_info.ref_exists) { - wp_free_device_async(alloc_info.context, ptr); + if (!wp_free_device_async(alloc_info.context, ptr)) + fprintf(stderr, "Warp warning: failed to free destroyed graph allocation asynchronously\n"); } } } @@ -649,7 +693,8 @@ static int process_deferred_graph_destroy_callbacks(void* context = NULL) if (tmp_info.context) { // GPU alloc if (tmp_info.is_async) { - wp_free_device_async(tmp_info.context, tmp_info.ptr); + if (!wp_free_device_async(tmp_info.context, tmp_info.ptr)) + fprintf(stderr, "Warp warning: failed to free temporary graph allocation asynchronously\n"); } else { wp_free_device_default(tmp_info.context, tmp_info.ptr); } @@ -744,10 +789,12 @@ void wp_free_device(void* context, void* ptr) int ordinal = wp_cuda_context_get_device_ordinal(context); // use stream-ordered allocator if available - if (wp_cuda_device_is_mempool_supported(ordinal)) - wp_free_device_async(context, ptr); - else + if (wp_cuda_device_is_mempool_supported(ordinal)) { + if (!wp_free_device_async(context, ptr)) + fprintf(stderr, "Warp warning: failed to free device allocation asynchronously\n"); + } else { wp_free_device_default(context, ptr); + } } void* wp_alloc_device_default(void* context, size_t s, const char* tag) @@ -778,6 +825,42 @@ void wp_free_device_default(void* context, void* ptr) } } +static void record_stream_ordered_graph_alloc(void* ptr, void* context, CUstream stream, const char* caller) +{ + if (!ptr || !wp_cuda_stream_is_capturing(stream)) + return; + + uint64_t capture_id = get_capture_id(stream); + auto capture_iter = g_captures.find(capture_id); + if (capture_iter == g_captures.end()) + return; + + GraphAllocInfo alloc_info; + alloc_info.capture_id = capture_id; + alloc_info.context = context ? context : get_current_context(); + alloc_info.ref_exists = true; + alloc_info.graph_destroyed = false; + + std::vector deps; + if (get_capture_dependencies(stream, deps)) { + for (cudaGraphNode_t node : deps) { + CUgraphNodeType node_type; + if (check_cu(cuGraphNodeGetType_f(node, &node_type)) && node_type == CU_GRAPH_NODE_TYPE_MEM_ALLOC) { + cudaMemAllocNodeParams params; + if (check_cuda(cudaGraphMemAllocNodeGetParams(node, ¶ms)) && params.dptr == ptr) { + alloc_info.node = node; + break; + } + } + } + } + + if (!alloc_info.node) + fprintf(stderr, "Warp warning: %s: failed to find memory allocation node\n", caller); + + g_graph_allocs[ptr] = alloc_info; +} + void* wp_alloc_device_async(void* context, size_t s, const char* tag) { // stream-ordered allocations don't rely on the current context, @@ -793,48 +876,46 @@ void* wp_alloc_device_async(void* context, size_t s, const char* tag) void* ptr = NULL; check_cuda(cudaMallocAsync(&ptr, s, stream)); - if (ptr) { - // if the stream is capturing, the allocation requires special handling - if (wp_cuda_stream_is_capturing(stream)) { - // check if this is a known capture - uint64_t capture_id = get_capture_id(stream); - auto capture_iter = g_captures.find(capture_id); - if (capture_iter != g_captures.end()) { - // remember graph allocation details - GraphAllocInfo alloc_info; - alloc_info.capture_id = capture_id; - alloc_info.context = context ? context : get_current_context(); - alloc_info.ref_exists = true; // user reference created and returned here - alloc_info.graph_destroyed = false; // graph not destroyed yet - - // find the MemAllocNode that was just added - std::vector deps; - if (get_capture_dependencies(stream, deps)) { - for (cudaGraphNode_t node : deps) { - CUgraphNodeType node_type; - if (check_cu(cuGraphNodeGetType_f(node, &node_type))) { - if (node_type == CU_GRAPH_NODE_TYPE_MEM_ALLOC) { - cudaMemAllocNodeParams params; - if (check_cuda(cudaGraphMemAllocNodeGetParams(node, ¶ms))) { - if (params.dptr == ptr) { - alloc_info.node = node; - break; - } - } - } - } - } - } + record_stream_ordered_graph_alloc(ptr, context, stream, __FUNCTION__); - // Warn if the node is not found. This is unlikely and it's not a critical error, - // but we must also handle this situation in wp_free_device_async(). - if (!alloc_info.node) { - fprintf(stderr, "Warp warning: %s: failed to find memory allocation node\n", __FUNCTION__); - } + if (g_alloc_tracker.enabled && ptr) + g_alloc_tracker.record_alloc(ptr, s, ALLOC_KIND_DEVICE, wp_cuda_context_get_device_ordinal(context), tag); - g_graph_allocs[ptr] = alloc_info; - } - } + return ptr; +} + +void* wp_alloc_device_managed(void* context, size_t s, const char* tag) +{ + ContextGuard guard(context); + + ContextInfo* context_info = get_context_info(context); + if (!context_info || !context_info->device_info) + return NULL; + + CUcontext owner_context = get_managed_alloc_context(context); + DeviceInfo* device_info = context_info->device_info; + if (!device_info->managed_memory) + return NULL; + + CUstream stream = context_info->stream; + void* ptr = NULL; + bool is_capturing = wp_cuda_stream_is_capturing(stream); + + if (is_capturing) { + wp::set_error_string( + "Warp error: managed allocation during CUDA graph capture is not supported. " + "Allocate before capture on this device." + ); + return NULL; + } + + if (!check_cuda(cudaMallocManaged(&ptr, s, cudaMemAttachGlobal))) + return NULL; + + if (ptr) { + std::lock_guard lock(g_managed_alloc_mutex); + g_destroyed_context_managed_direct_allocs.erase(ptr); + g_managed_direct_allocs[ptr] = owner_context; } if (g_alloc_tracker.enabled && ptr) @@ -843,11 +924,69 @@ void* wp_alloc_device_async(void* context, size_t s, const char* tag) return ptr; } -void wp_free_device_async(void* context, void* ptr, void** dbg_node_ret) +bool wp_free_device_managed(void* context, void* ptr) { - if (g_alloc_tracker.enabled && ptr) + if (!ptr) + return true; + + CUcontext owner_context = get_managed_alloc_context(context); + { + std::lock_guard lock(g_managed_alloc_mutex); + + auto direct_iter = g_managed_direct_allocs.find(ptr); + if (direct_iter == g_managed_direct_allocs.end()) { + auto destroyed_iter = g_destroyed_context_managed_direct_allocs.find(ptr); + if (destroyed_iter != g_destroyed_context_managed_direct_allocs.end()) { + if (destroyed_iter->second != owner_context) { + wp::set_error_string( + "Warp error: managed device allocation %p owned by destroyed context %p, requested by %p", ptr, + static_cast(destroyed_iter->second), static_cast(owner_context) + ); + return false; + } + + g_destroyed_context_managed_direct_allocs.erase(destroyed_iter); + return true; + } + + wp::set_error_string("Warp error: unknown managed device allocation %p", ptr); + return false; + } + + if (direct_iter->second != owner_context) { + wp::set_error_string( + "Warp error: managed device allocation %p owned by different context (owner=%p, requester=%p)", ptr, + static_cast(direct_iter->second), static_cast(owner_context) + ); + return false; + } + + g_managed_direct_allocs.erase(direct_iter); + } + + ContextGuard guard(context); + + if (g_captures.empty()) { + if (!check_cuda(cudaFree(ptr))) { + std::lock_guard lock(g_managed_alloc_mutex); + g_managed_direct_allocs[ptr] = owner_context; + return false; + } + } else { + deferred_free(ptr, context, false); + } + + if (g_alloc_tracker.enabled) g_alloc_tracker.record_free(ptr); + return true; +} + +static bool wp_free_device_async_impl(void* context, void* ptr, void** dbg_node_ret) +{ + if (!ptr) + return true; + // stream-ordered allocators generally don't rely on the current context, // but we set the context here for consistent behaviour ContextGuard guard(context); @@ -866,7 +1005,8 @@ void wp_free_device_async(void* context, void* ptr, void** dbg_node_ret) // cudaFreeAsync on the null stream does not block or trigger synchronization, but it postpones // the deallocation until a synchronization point is reached, so preceding work on this pointer // should safely complete. - check_cuda(cudaFreeAsync(ptr, NULL)); + if (!check_cuda(cudaFreeAsync(ptr, NULL))) + return false; } else { // We must defer the free operation until graph capture completes. deferred_free(ptr, context, true); @@ -875,6 +1015,7 @@ void wp_free_device_async(void* context, void* ptr, void** dbg_node_ret) // get the graph allocation details GraphAllocInfo& alloc_info = alloc_iter->second; uint64_t capture_id = alloc_info.capture_id; + auto forget_graph_alloc = [&]() { g_graph_allocs.erase(alloc_iter); }; // check if the capture is still active auto capture_iter = g_captures.find(capture_id); @@ -883,8 +1024,8 @@ void wp_free_device_async(void* context, void* ptr, void** dbg_node_ret) cudaGraph_t graph = get_capture_graph(capture->stream); if (!graph) { fprintf(stderr, "Warp warning: %s: failed to get capture graph\n", __FUNCTION__); - g_graph_allocs.erase(alloc_iter); - return; + forget_graph_alloc(); + return false; } cudaGraphNode_t free_node = NULL; @@ -894,8 +1035,8 @@ void wp_free_device_async(void* context, void* ptr, void** dbg_node_ret) std::vector alloc_leaf_nodes; if (!get_dependent_leaf_nodes(alloc_info.node, alloc_leaf_nodes)) { fprintf(stderr, "Warp warning: %s: failed to get allocation-dependent nodes\n", __FUNCTION__); - g_graph_allocs.erase(alloc_iter); - return; + forget_graph_alloc(); + return false; } // Add a mem free node. All graph leaf nodes that are descendants of the alloc node @@ -906,8 +1047,8 @@ void wp_free_device_async(void* context, void* ptr, void** dbg_node_ret) &free_node, graph, alloc_leaf_nodes.data(), alloc_leaf_nodes.size(), ptr ))) { fprintf(stderr, "Warp warning: %s: failed to add a memory free node\n", __FUNCTION__); - g_graph_allocs.erase(alloc_iter); - return; + forget_graph_alloc(); + return false; } // Update the capture dependencies for affected child streams, if the streams are still alive. @@ -949,9 +1090,17 @@ void wp_free_device_async(void* context, void* ptr, void** dbg_node_ret) new_deps.push_back(dep); } } - check_cu(cuStreamUpdateCaptureDependencies_f( - other_stream, new_deps.data(), new_deps.size(), CU_STREAM_SET_CAPTURE_DEPENDENCIES - )); + if (!check_cu(cuStreamUpdateCaptureDependencies_f( + other_stream, new_deps.data(), new_deps.size(), + CU_STREAM_SET_CAPTURE_DEPENDENCIES + ))) { + fprintf( + stderr, "Warp warning: %s: failed to update capture dependencies\n", + __FUNCTION__ + ); + forget_graph_alloc(); + return false; + } } } } @@ -962,14 +1111,24 @@ void wp_free_device_async(void* context, void* ptr, void** dbg_node_ret) // deallocating. This produces correct graphs, but may introduce unnecessary stream serialization with // multi-stream captures. std::vector leaf_nodes; - if (get_graph_leaf_nodes(graph, leaf_nodes)) { - if (check_cuda( - cudaGraphAddMemFreeNode(&free_node, graph, leaf_nodes.data(), leaf_nodes.size(), ptr) - )) { - check_cu(cuStreamUpdateCaptureDependencies_f( - capture->stream, &free_node, 1, CU_STREAM_SET_CAPTURE_DEPENDENCIES - )); - } + if (!get_graph_leaf_nodes(graph, leaf_nodes)) { + fprintf(stderr, "Warp warning: %s: failed to get graph leaf nodes\n", __FUNCTION__); + forget_graph_alloc(); + return false; + } + if (!check_cuda( + cudaGraphAddMemFreeNode(&free_node, graph, leaf_nodes.data(), leaf_nodes.size(), ptr) + )) { + fprintf(stderr, "Warp warning: %s: failed to add a memory free node\n", __FUNCTION__); + forget_graph_alloc(); + return false; + } + if (!check_cu(cuStreamUpdateCaptureDependencies_f( + capture->stream, &free_node, 1, CU_STREAM_SET_CAPTURE_DEPENDENCIES + ))) { + fprintf(stderr, "Warp warning: %s: failed to update capture dependencies\n", __FUNCTION__); + forget_graph_alloc(); + return false; } } @@ -993,7 +1152,8 @@ void wp_free_device_async(void* context, void* ptr, void** dbg_node_ret) cudaGetLastError(); } else { // check for other errors - check_cuda(res); + if (!check_cuda(res)) + return false; } } else { // We must defer the operation until graph capture completes. @@ -1009,6 +1169,16 @@ void wp_free_device_async(void* context, void* ptr, void** dbg_node_ret) } } } + + if (g_alloc_tracker.enabled) + g_alloc_tracker.record_free(ptr); + + return true; +} + +bool wp_free_device_async(void* context, void* ptr, void** dbg_node_ret) +{ + return wp_free_device_async_impl(context, ptr, dbg_node_ret); } bool wp_memcpy_h2d(void* context, void* dest, void* src, size_t n, void* stream) @@ -2489,6 +2659,59 @@ int wp_cuda_device_get_host_native_atomic_supported(int ordinal) return 0; } +int wp_cuda_device_get_managed_memory(int ordinal) +{ + if (ordinal >= 0 && ordinal < int(g_devices.size())) + return g_devices[ordinal].managed_memory; + return 0; +} + +int wp_cuda_device_get_concurrent_managed_access(int ordinal) +{ + if (ordinal >= 0 && ordinal < int(g_devices.size())) + return g_devices[ordinal].concurrent_managed_access; + return 0; +} + +int wp_cuda_pointer_get_memory_kind(void* context, void* ptr) +{ + if (!ptr) + return WP_MEMORY_KIND_UNKNOWN; + + ContextGuard guard(context); + + unsigned int is_managed = 0; + CUresult managed_result + = cuPointerGetAttribute_f(&is_managed, CU_POINTER_ATTRIBUTE_IS_MANAGED, reinterpret_cast(ptr)); + if (managed_result != CUDA_SUCCESS) + return WP_MEMORY_KIND_UNKNOWN; + if (is_managed) + return WP_MEMORY_KIND_CUDA_MANAGED; + + unsigned int memory_type = 0; + CUresult memory_type_result + = cuPointerGetAttribute_f(&memory_type, CU_POINTER_ATTRIBUTE_MEMORY_TYPE, reinterpret_cast(ptr)); + if (memory_type_result != CUDA_SUCCESS) + return WP_MEMORY_KIND_UNKNOWN; + + if (memory_type == CU_MEMORYTYPE_HOST) + return WP_MEMORY_KIND_PINNED_HOST; + + if (memory_type != CU_MEMORYTYPE_DEVICE) + return WP_MEMORY_KIND_UNKNOWN; + + CUmemoryPool mempool = NULL; + CUresult mempool_result + = cuPointerGetAttribute_f(&mempool, CU_POINTER_ATTRIBUTE_MEMPOOL_HANDLE, reinterpret_cast(ptr)); + if (mempool_result != CUDA_SUCCESS) + return WP_MEMORY_KIND_CUDA_DEVICE; + + if (mempool) + return WP_MEMORY_KIND_CUDA_MEMPOOL; + + return WP_MEMORY_KIND_CUDA_DEVICE; +} + int wp_cuda_device_is_mempool_supported(int ordinal) { if (ordinal >= 0 && ordinal < int(g_devices.size())) @@ -2716,6 +2939,8 @@ void wp_cuda_context_destroy(void* context) if (info->conditional_module) check_cu(cuModuleUnload_f(info->conditional_module)); + mark_context_managed_direct_allocs_released(ctx); + g_contexts.erase(ctx); } @@ -3634,7 +3859,10 @@ void* wp_cuda_graph_insert_free_node(void* context, void* alloc_node_) // use wp_free_device_async() and get the free node it added void* free_node = NULL; - wp_free_device_async(context, ptr, &free_node); + if (!wp_free_device_async(context, ptr, &free_node) || !free_node) { + wp::set_error_string("Warp error: failed to insert MemFreeNode"); + return NULL; + } return free_node; } diff --git a/warp/native/warp.h b/warp/native/warp.h index 605c54b4ea..df78cced8e 100644 --- a/warp/native/warp.h +++ b/warp/native/warp.h @@ -18,6 +18,15 @@ #define WP_CUDA_GRAPH_CAPTURE_MODE_THREAD_LOCAL 1 #define WP_CUDA_GRAPH_CAPTURE_MODE_RELAXED 2 +enum wp_memory_kind { + WP_MEMORY_KIND_UNKNOWN = 0, + WP_MEMORY_KIND_HOST = 1, + WP_MEMORY_KIND_PINNED_HOST = 2, + WP_MEMORY_KIND_CUDA_DEVICE = 3, + WP_MEMORY_KIND_CUDA_MEMPOOL = 4, + WP_MEMORY_KIND_CUDA_MANAGED = 5, +}; + struct timing_result_t; // this is the core runtime API exposed on the DLL level @@ -54,12 +63,14 @@ WP_API void* wp_alloc_pinned(size_t s, const char* tag = nullptr); WP_API void* wp_alloc_device(void* context, size_t s, const char* tag = nullptr); WP_API void* wp_alloc_device_default(void* context, size_t s, const char* tag = nullptr); WP_API void* wp_alloc_device_async(void* context, size_t s, const char* tag = nullptr); +WP_API void* wp_alloc_device_managed(void* context, size_t s, const char* tag = nullptr); WP_API void wp_free_host(void* ptr); WP_API void wp_free_pinned(void* ptr); WP_API void wp_free_device(void* context, void* ptr); // uses cudaFreeAsync() if supported, cudaFree() otherwise WP_API void wp_free_device_default(void* context, void* ptr); // uses cudaFree() -WP_API void wp_free_device_async(void* context, void* ptr, void** dbg_node_ret = nullptr); // uses cudaFreeAsync() +WP_API bool wp_free_device_async(void* context, void* ptr, void** dbg_node_ret = nullptr); // uses cudaFreeAsync() +WP_API bool wp_free_device_managed(void* context, void* ptr); WP_API bool wp_memcpy_h2h(void* dest, void* src, size_t n); WP_API bool wp_memcpy_h2d(void* context, void* dest, void* src, size_t n, void* stream = WP_CURRENT_STREAM); @@ -475,6 +486,9 @@ WP_API int wp_cuda_device_is_uva(int ordinal); WP_API int wp_cuda_device_get_pageable_memory_access(int ordinal); WP_API int wp_cuda_device_get_direct_managed_mem_access_from_host(int ordinal); WP_API int wp_cuda_device_get_host_native_atomic_supported(int ordinal); +WP_API int wp_cuda_device_get_managed_memory(int ordinal); +WP_API int wp_cuda_device_get_concurrent_managed_access(int ordinal); +WP_API int wp_cuda_pointer_get_memory_kind(void* context, void* ptr); WP_API int wp_cuda_device_is_mempool_supported(int ordinal); WP_API int wp_cuda_device_is_ipc_supported(int ordinal); WP_API int wp_cuda_device_set_mempool_release_threshold(int ordinal, uint64_t threshold); diff --git a/warp/tests/cuda/test_unified_memory.py b/warp/tests/cuda/test_unified_memory.py index 06c338c157..97b0c20c11 100644 --- a/warp/tests/cuda/test_unified_memory.py +++ b/warp/tests/cuda/test_unified_memory.py @@ -6,7 +6,7 @@ These tests cover Warp's conservative memory-access capability reporting, default launch behavior for mixed-device array arguments, and opt-in launch array access checks through ``wp.config.launch_array_access_mode`` enum modes. They -also check that checked mode uses allocation-specific CUDA access rules +also check that checked mode uses memory-kind-specific CUDA access rules where possible: ordinary CPU memory, pinned CPU memory, default CUDA allocations, CUDA memory pool allocations, and array views backed by a parent allocation. @@ -95,6 +95,8 @@ def test_unified_memory_device_capabilities(test, device): "is_cpu_memory_access_from_gpu_supported", "is_gpu_memory_access_from_cpu_supported", "is_cpu_gpu_atomic_supported", + "is_managed_memory_supported", + "is_concurrent_managed_access_supported", ): test.assertIsInstance(getattr(device, attr), bool) @@ -102,6 +104,8 @@ def test_unified_memory_device_capabilities(test, device): test.assertFalse(device.is_cpu_memory_access_from_gpu_supported) test.assertFalse(device.is_gpu_memory_access_from_cpu_supported) test.assertFalse(device.is_cpu_gpu_atomic_supported) + test.assertFalse(device.is_managed_memory_supported) + test.assertFalse(device.is_concurrent_managed_access_supported) def test_unified_memory_launch_array_access_mode_config(test, device): @@ -188,7 +192,7 @@ def test_unified_memory_checked_rejects_indexedarray_with_inaccessible_indices(t with emulate_non_coherent_uva_cuda_device(device): test.assertFalse(wp.can_access(device, src)) with launch_array_access_mode(wp.config.LaunchArrayAccessMode.CHECKED): - with test.assertRaisesRegex(RuntimeError, "array allocation is not accessible or cannot be verified"): + with test.assertRaisesRegex(RuntimeError, "memory is not accessible or cannot be verified"): wp.launch(read_indexed_cpu_write_gpu, dim=src.size, inputs=[src], outputs=[dst], device=device) @@ -219,7 +223,7 @@ def test_unified_memory_verify_rejects_gpu_reading_cpu_when_unsupported(test, de dst = wp.empty(4, dtype=wp.float32, device=device) with launch_array_access_mode(wp.config.LaunchArrayAccessMode.CHECKED): - with test.assertRaisesRegex(RuntimeError, "array allocation is not accessible or cannot be verified"): + with test.assertRaisesRegex(RuntimeError, "memory is not accessible or cannot be verified"): wp.launch(read_cpu_write_gpu, dim=src.size, inputs=[src], outputs=[dst], device=device, record_cmd=True) @@ -227,16 +231,21 @@ def test_unified_memory_verify_rejects_cpu_reading_gpu_when_unsupported(test, de """Warp default CUDA allocations are not treated as CPU-accessible managed memory. CUDA exposes a host-to-managed-memory capability, but Warp's built-in CUDA - arrays are allocated with default device allocation APIs. Checked launch + arrays are allocated with CUDA malloc or memory-pool APIs. Checked launch verification must therefore reject CPU launches that receive those arrays. """ - src = wp.array(np.arange(4, dtype=np.float32), dtype=wp.float32, device=device) + with wp.ScopedMempool(device, False): + src = wp.array(np.arange(4, dtype=np.float32), dtype=wp.float32, device=device) dst = wp.empty(4, dtype=wp.float32, device="cpu") + test.assertIs(src.memory_kind, wp.MemoryKind.CUDA_DEVICE) + with launch_array_access_mode(wp.config.LaunchArrayAccessMode.CHECKED): - with test.assertRaisesRegex(RuntimeError, "array allocation is not accessible or cannot be verified"): + with test.assertRaises(RuntimeError) as exception_context: wp.launch(read_gpu_write_cpu, dim=src.size, inputs=[src], outputs=[dst], device="cpu", record_cmd=True) + test.assertIn("memory is not accessible or cannot be verified", str(exception_context.exception)) + test.assertIn("memory_kind=CUDA_DEVICE", str(exception_context.exception)) def test_unified_memory_relaxed_allows_cpu_launch_with_gpu_array(test, device): @@ -347,6 +356,191 @@ def test_unified_memory_array_view_allocator_lookup_uses_parent_array(test, devi test.assertIs(wp._src.context._get_array_allocator(src_slice), src._allocator) +def test_unified_memory_cuda_memory_kind_queries(test, device): + """array.memory_kind is derived from CUDA pointer attributes.""" + + with wp.ScopedMempool(device, False): + default_arr = wp.empty(4, dtype=wp.float32, device=device) + + test.assertIs(default_arr.memory_kind, wp.MemoryKind.CUDA_DEVICE) + test.assertIs(default_arr[1:].memory_kind, wp.MemoryKind.CUDA_DEVICE) + + if device.is_mempool_supported: + with wp.ScopedMempool(device, True): + mempool_arr = wp.empty(4, dtype=wp.float32, device=device) + test.assertIs(mempool_arr.memory_kind, wp.MemoryKind.CUDA_MEMPOOL) + test.assertIs(mempool_arr[1:].memory_kind, wp.MemoryKind.CUDA_MEMPOOL) + + if device.is_managed_memory_supported: + managed = wp.ManagedAllocator() + with wp.ScopedAllocator(device, managed): + managed_arr = wp.empty(4, dtype=wp.float32, device=device) + test.assertIs(managed_arr.memory_kind, wp.MemoryKind.CUDA_MANAGED) + test.assertIs(managed_arr[1:].memory_kind, wp.MemoryKind.CUDA_MANAGED) + + +def test_unified_memory_wrapped_cuda_pointer_memory_kind(test, device): + """Externally wrapped CUDA pointers are classified through CUDA attributes.""" + + if not device.is_managed_memory_supported: + test.skipTest(f"{device} does not support CUDA managed memory") + + cpu = wp.get_device("cpu") + managed = wp.ManagedAllocator() + with wp.ScopedAllocator(device, managed): + owner = wp.empty(4, dtype=wp.float32, device=device) + + wrapped = wp.array(ptr=owner.ptr, dtype=wp.float32, shape=owner.shape, device=device, copy=False) + + test.assertIs(wrapped.memory_kind, wp.MemoryKind.CUDA_MANAGED) + expected_cpu_access = ( + device.is_concurrent_managed_access_supported or device.is_gpu_memory_access_from_cpu_supported + ) + test.assertEqual(wp.can_access(cpu, wrapped), expected_cpu_access) + + +def test_unified_memory_managed_allocator_can_access(test, device): + """Managed arrays use managed-memory access rules, not peer or mempool access rules.""" + + if not device.is_managed_memory_supported: + test.skipTest(f"{device} does not support CUDA managed memory") + + cpu = wp.get_device("cpu") + managed = wp.ManagedAllocator() + + with wp.ScopedAllocator(device, managed): + arr = wp.empty(8, dtype=wp.float32, device=device) + + expected_cpu_access = ( + device.is_concurrent_managed_access_supported or device.is_gpu_memory_access_from_cpu_supported + ) + test.assertEqual(wp.can_access(cpu, arr), expected_cpu_access) + + for target in wp.get_cuda_devices(): + with test.subTest(target=target): + test.assertEqual(wp.can_access(target, arr), target.is_managed_memory_supported) + + +def test_unified_memory_ipc_handle_rejects_unsupported_arrays(test, device): + """CUDA IPC handles are only exposed for representable CUDA device allocations.""" + + cases = [] + + if device.is_managed_memory_supported: + managed = wp.ManagedAllocator() + with wp.ScopedAllocator(device, managed): + managed_arr = wp.empty(4, dtype=wp.float32, device=device) + + cases.extend( + ( + ("managed array", managed_arr, "managed-memory arrays"), + ("managed array view", managed_arr[1:], "managed-memory arrays"), + ) + ) + + if device.is_ipc_supported is not False: + with wp.ScopedMempool(device, False): + cuda_arr = wp.empty(8, dtype=wp.float32, device=device) + + external_view = wp.array( + ptr=cuda_arr.ptr + 2 * cuda_arr.strides[0], dtype=wp.float32, shape=(4,), device=device + ) + cases.extend( + ( + ("default CUDA array view", cuda_arr[2:], "array views"), + ("externally wrapped CUDA pointer", external_view, "externally wrapped arrays"), + ) + ) + + if not cases: + test.skipTest(f"{device} does not support CUDA managed memory or IPC") + + for name, arr, message in cases: + with test.subTest(kind=name): + with test.assertRaisesRegex(RuntimeError, message): + arr.ipc_handle() + + +def test_unified_memory_checked_cpu_launch_with_managed_array(test, device): + """Checked mode accepts CPU access to managed arrays only on devices that report host access support.""" + + if not device.is_managed_memory_supported: + test.skipTest(f"{device} does not support CUDA managed memory") + + managed = wp.ManagedAllocator() + src_np = np.arange(4, dtype=np.float32) + with wp.ScopedAllocator(device, managed): + src = wp.array(src_np, dtype=wp.float32, device=device) + + dst = wp.empty(4, dtype=wp.float32, device="cpu") + expected_cpu_access = ( + device.is_concurrent_managed_access_supported or device.is_gpu_memory_access_from_cpu_supported + ) + + with launch_array_access_mode(wp.config.LaunchArrayAccessMode.CHECKED): + if expected_cpu_access: + # The GPU-side initialization of ``src`` must complete before a CPU kernel reads it. + wp.synchronize_device(device) + wp.launch(read_gpu_write_cpu, dim=src.size, inputs=[src], outputs=[dst], device="cpu") + np.testing.assert_allclose(dst.numpy(), src_np + 3.0) + else: + with test.assertRaisesRegex(RuntimeError, "memory is not accessible or cannot be verified"): + wp.launch( + read_gpu_write_cpu, + dim=src.size, + inputs=[src], + outputs=[dst], + device="cpu", + record_cmd=True, + ) + + +def test_unified_memory_managed_array_cross_device_graph_capture(test, device): + """Preallocated managed arrays can be used by captured kernels on another CUDA device.""" + + if wp.get_cuda_device_count() < 2: + test.skipTest("Multi-GPU not available") + if not check_p2p(): + test.skipTest("Peer-to-Peer transfers not supported") + if not device.is_managed_memory_supported: + test.skipTest(f"{device} does not support CUDA managed memory") + + target = next(d for d in wp.get_cuda_devices() if d != device) + if not target.is_managed_memory_supported: + test.skipTest(f"{target} does not support CUDA managed memory") + + managed = wp.ManagedAllocator() + with wp.ScopedAllocator(device, managed): + src = wp.array(np.arange(8, dtype=np.float32), dtype=wp.float32, device=device) + + # The target graph reads src from another device; wait for its H2D initialization. + wp.synchronize_device(device) + test.assertTrue(wp.can_access(target, src)) + + dst = wp.empty(src.size, dtype=wp.float32, device=target) + wp.load_module(device=target) + with launch_array_access_mode(wp.config.LaunchArrayAccessMode.CHECKED): + with wp.ScopedCapture(device=target, force_module_load=False) as capture: + wp.launch(read_cpu_write_gpu, dim=src.size, inputs=[src], outputs=[dst], device=target) + + wp.capture_launch(capture.graph) + np.testing.assert_allclose(dst.numpy(), np.arange(8, dtype=np.float32) * 2.0) + + +def test_unified_memory_managed_allocator_rejects_capture_allocation(test, device): + """ManagedAllocator rejects allocation during CUDA graph capture.""" + + if not device.is_managed_memory_supported: + test.skipTest(f"{device} does not support CUDA managed memory") + + wp.load_module(device=device) + managed = wp.ManagedAllocator() + with wp.ScopedAllocator(device, managed): + with test.assertRaisesRegex(RuntimeError, "managed allocation during CUDA graph capture"): + with wp.ScopedCapture(device=device, force_module_load=False): + wp.empty(8, dtype=wp.float32, device=device) + + devices = get_test_devices() cuda_devices = get_cuda_test_devices() @@ -374,13 +568,46 @@ def test_unified_memory_unknown_access_warning_cache_is_bounded(self): cache.update(saved_cache) @unittest.skipUnless(wp.is_cuda_available(), "CUDA not available") - def test_unified_memory_checked_warns_once_for_custom_allocator(self): - """CHECKED warns once for unknown custom allocator provenance. + def test_unified_memory_checked_warns_once_for_unknown_memory_kind_launch(self): + """CHECKED launch validation warns once for UNKNOWN memory kind access.""" + + cache = warp_context._launch_array_access_warnings_seen + saved_cache = cache.copy() + cache.clear() + try: + device = wp.get_device("cuda:0") + cpu = wp.get_device("cpu") + n = 4 + + src = wp.empty(n, dtype=wp.float32, device=device) + dst = wp.empty(n, dtype=wp.float32, device=cpu) + + with ( + patch.object(warp_context, "_get_array_memory_kind", return_value=wp.MemoryKind.UNKNOWN), + patch("warp._src.context.log_warning") as mock_log_warning, + launch_array_access_mode(wp.config.LaunchArrayAccessMode.CHECKED), + ): + cmd0 = wp.launch(read_gpu_write_cpu, dim=n, inputs=[src], outputs=[dst], device=cpu, record_cmd=True) + cmd1 = wp.launch(read_gpu_write_cpu, dim=n, inputs=[src], outputs=[dst], device=cpu, record_cmd=True) + + self.assertIsInstance(cmd0, wp.Launch) + self.assertIsInstance(cmd1, wp.Launch) + matching = [ + call for call in mock_log_warning.call_args_list if "cannot verify cross-device access" in call.args[0] + ] + self.assertEqual(len(matching), 1) + self.assertIn("memory_kind=UNKNOWN", matching[0].args[0]) + finally: + cache.clear() + cache.update(saved_cache) + + @unittest.skipUnless(wp.is_cuda_available(), "CUDA not available") + def test_unified_memory_checked_rejects_custom_allocator_cuda_device_pointer(self): + """CHECKED rejects custom allocator CUDA device pointers from CPU launches. - The delegating allocator returns an ordinary CUDA pointer, but the - launch verifier only sees an unknown custom allocator. Checked mode - should warn about the unverified cross-device launch pattern without - warning again for the same kernel/argument/device combination. + The delegating allocator returns an ordinary CUDA pointer whose memory + kind can be identified. CPU access to CUDA device memory is known to be + inaccessible, so checked mode must reject the launch instead of warning. """ device = wp.get_device("cuda:0") @@ -395,24 +622,19 @@ def test_unified_memory_checked_warns_once_for_custom_allocator(self): self.assertFalse(wp.can_access(cpu, src)) with launch_array_access_mode(wp.config.LaunchArrayAccessMode.CHECKED): - with patch("warp._src.context.log_warning") as mock_log_warning: - cmd0 = wp.launch(read_gpu_write_cpu, dim=n, inputs=[src], outputs=[dst], device=cpu, record_cmd=True) - cmd1 = wp.launch(read_gpu_write_cpu, dim=n, inputs=[src], outputs=[dst], device=cpu, record_cmd=True) + with self.assertRaises(RuntimeError) as exception_context: + wp.launch(read_gpu_write_cpu, dim=n, inputs=[src], outputs=[dst], device=cpu, record_cmd=True) - self.assertIsInstance(cmd0, wp.Launch) - self.assertIsInstance(cmd1, wp.Launch) - matching = [ - call for call in mock_log_warning.call_args_list if "cannot verify cross-device access" in call.args[0] - ] - self.assertEqual(len(matching), 1) + self.assertIn("memory is not accessible or cannot be verified", str(exception_context.exception)) + self.assertIn("memory_kind=CUDA_DEVICE", str(exception_context.exception)) @unittest.skipUnless(wp.is_cuda_available(), "CUDA not available") def test_unified_memory_strict_rejects_custom_allocator_cross_device(self): """STRICT still rejects cross-device arrays with custom allocators. Strict mode intentionally restores Warp's old same-device policy before - allocator-specific reachability matters. This keeps custom allocator - provenance from creating a loophole in strict validation. + memory-kind-specific reachability matters. This keeps custom allocators + from creating a loophole in strict validation. """ device = wp.get_device("cuda:0") @@ -430,11 +652,11 @@ def test_unified_memory_strict_rejects_custom_allocator_cross_device(self): @unittest.skipUnless(wp.is_cuda_available(), "CUDA not available") def test_unified_memory_relaxed_does_not_warn_for_custom_allocator(self): - """RELAXED keeps passing unknown custom allocator launches through silently. + """RELAXED keeps passing custom allocator launches through silently. Relaxed mode is the default pass-through policy for users who already - know their hardware and allocation are valid. Unknown custom allocator - provenance should not emit the checked-mode diagnostic in this mode. + know their hardware and allocation are valid. Custom allocator CUDA + device pointers should not emit the checked-mode diagnostic in this mode. """ device = wp.get_device("cuda:0") @@ -571,7 +793,7 @@ def test_unified_memory_verify_uses_mempool_access_for_cuda_mempool_allocations( An array allocated while the source device's mempool is enabled needs the CUDA mempool access predicate. The companion rejection test keeps peer access enabled while mempool access is disabled, so the pair - isolates the allocation-specific mempool rule without executing this + isolates the memory-kind-specific mempool rule without executing this peer kernel in a recently changed pool-access state with peer access disabled. """ @@ -611,7 +833,7 @@ def test_unified_memory_verify_uses_parent_allocator_for_cuda_mempool_slices(sel This covers the view case for CUDA mempool-backed storage. Checked verification must follow the slice's parent allocation and then apply - mempool access rules, rather than falling back to unknown provenance. + mempool access rules, rather than falling back to an unknown memory kind. """ target_device, peer_device = get_cuda_device_pair_with_mempool_access_support() @@ -641,6 +863,63 @@ def test_unified_memory_verify_uses_parent_allocator_for_cuda_mempool_slices(sel wp.set_peer_access_enabled(target_device, peer_device, peer_access_saved) wp.set_mempool_access_enabled(target_device, peer_device, mempool_access_saved) + @unittest.skipUnless( + get_cuda_device_pair_with_mempool_access_support(), "Requires devices with mempool access support" + ) + @unittest.skipUnless(check_p2p(), "Peer-to-Peer transfers not supported") + def test_unified_memory_checked_warns_for_unowned_cuda_mempool_pointer(self): + """CHECKED treats unowned CUDA mempool pointers as unknown access.""" + + cache = warp_context._launch_array_access_warnings_seen + saved_cache = cache.copy() + cache.clear() + try: + target_device, peer_device = get_cuda_device_pair_with_mempool_access_support() + n = 8 + + with wp.ScopedMempool(target_device, True): + owner = wp.array(np.arange(n, dtype=np.float32), dtype=wp.float32, device=target_device) + src = wp.array(ptr=owner.ptr, dtype=wp.float32, shape=owner.shape, device=target_device, copy=False) + dst = wp.empty(n, dtype=wp.float32, device=peer_device) + + self.assertIs(owner.memory_kind, wp.MemoryKind.CUDA_MEMPOOL) + self.assertIs(src.memory_kind, wp.MemoryKind.CUDA_MEMPOOL) + self.assertIsNone(warp_context._get_array_allocator(src)) + + with ( + patch.object(warp_context, "is_mempool_access_enabled", return_value=True) as mock_access, + patch("warp._src.context.log_warning") as mock_log_warning, + launch_array_access_mode(wp.config.LaunchArrayAccessMode.CHECKED), + ): + cmd0 = wp.launch( + read_cpu_write_gpu, + dim=n, + inputs=[src], + outputs=[dst], + device=peer_device, + record_cmd=True, + ) + cmd1 = wp.launch( + read_cpu_write_gpu, + dim=n, + inputs=[src], + outputs=[dst], + device=peer_device, + record_cmd=True, + ) + + self.assertIsInstance(cmd0, wp.Launch) + self.assertIsInstance(cmd1, wp.Launch) + mock_access.assert_not_called() + matching = [ + call for call in mock_log_warning.call_args_list if "cannot verify cross-device access" in call.args[0] + ] + self.assertEqual(len(matching), 1) + self.assertIn("memory_kind=CUDA_MEMPOOL", matching[0].args[0]) + finally: + cache.clear() + cache.update(saved_cache) + @unittest.skipUnless( get_cuda_device_pair_with_mempool_access_support(), "Requires devices with mempool access support" ) @@ -678,9 +957,7 @@ def test_unified_memory_verify_rejects_mempool_allocation_without_mempool_access with launch_array_access_mode(wp.config.LaunchArrayAccessMode.CHECKED): mock_access.reset_mock() - with self.assertRaisesRegex( - RuntimeError, "array allocation is not accessible or cannot be verified" - ): + with self.assertRaisesRegex(RuntimeError, "memory is not accessible or cannot be verified"): wp.launch( read_cpu_write_gpu, dim=n, @@ -778,5 +1055,48 @@ def test_unified_memory_verify_rejects_mempool_allocation_without_mempool_access test_unified_memory_array_view_allocator_lookup_uses_parent_array, devices=devices, ) +add_function_test( + TestUnifiedMemory, + "test_unified_memory_cuda_memory_kind_queries", + test_unified_memory_cuda_memory_kind_queries, + devices=cuda_devices, +) +add_function_test( + TestUnifiedMemory, + "test_unified_memory_wrapped_cuda_pointer_memory_kind", + test_unified_memory_wrapped_cuda_pointer_memory_kind, + devices=cuda_devices, +) +add_function_test( + TestUnifiedMemory, + "test_unified_memory_managed_allocator_can_access", + test_unified_memory_managed_allocator_can_access, + devices=cuda_devices, +) +add_function_test( + TestUnifiedMemory, + "test_unified_memory_ipc_handle_rejects_unsupported_arrays", + test_unified_memory_ipc_handle_rejects_unsupported_arrays, + devices=cuda_devices, +) +add_function_test( + TestUnifiedMemory, + "test_unified_memory_checked_cpu_launch_with_managed_array", + test_unified_memory_checked_cpu_launch_with_managed_array, + devices=cuda_devices, +) +add_function_test( + TestUnifiedMemory, + "test_unified_memory_managed_array_cross_device_graph_capture", + test_unified_memory_managed_array_cross_device_graph_capture, + devices=cuda_devices, +) +add_function_test( + TestUnifiedMemory, + "test_unified_memory_managed_allocator_rejects_capture_allocation", + test_unified_memory_managed_allocator_rejects_capture_allocation, + devices=cuda_devices, + check_output=False, +) if __name__ == "__main__": unittest.main(verbosity=2) diff --git a/warp/tests/test_allocator.py b/warp/tests/test_allocator.py index b22ecb041c..c50f312bfc 100644 --- a/warp/tests/test_allocator.py +++ b/warp/tests/test_allocator.py @@ -147,6 +147,36 @@ def test_protocol_conformance_cpu(self): self.assertIsInstance(cpu.default_allocator, Allocator) self.assertIsInstance(cpu.pinned_allocator, Allocator) + def test_public_memory_kind_for_cpu_arrays(self): + """array.memory_kind reports observed CPU memory kind.""" + + a = wp.empty(4, dtype=wp.float32, device="cpu") + self.assertIs(a.memory_kind, wp.MemoryKind.HOST) + self.assertIs(a[1:].memory_kind, wp.MemoryKind.HOST) + + zero_size = wp.empty(0, dtype=wp.float32, device="cpu") + self.assertIs(zero_size.memory_kind, wp.MemoryKind.HOST) + + if wp.is_cuda_available(): + pinned = wp.empty(4, dtype=wp.float32, device="cpu", pinned=True) + self.assertIs(pinned.memory_kind, wp.MemoryKind.PINNED_HOST) + + zero_size_pinned = wp.empty(0, dtype=wp.float32, device="cpu", pinned=True) + self.assertIs(zero_size_pinned.memory_kind, wp.MemoryKind.PINNED_HOST) + + data = np.empty(4, dtype=np.float32) + wrapped = wp.array(data, dtype=wp.float32, device="cpu", copy=False) + self.assertEqual(wrapped.ptr, data.ctypes.data) + self.assertIs(wrapped.memory_kind, wp.MemoryKind.HOST) + + annotation = wp.array(dtype=wp.float32) + self.assertIs(annotation.memory_kind, wp.MemoryKind.UNKNOWN) + + indices = wp.array([0, 1], dtype=wp.int32, device="cpu") + indexed = wp.indexedarray(a, indices) + self.assertFalse(hasattr(indexed, "memory_kind")) + self.assertFalse(hasattr(wp.indexedarray[wp.float32], "memory_kind")) + def test_protocol_conformance_cuda(test, device): """Built-in CUDA allocators satisfy the Allocator protocol.""" @@ -154,6 +184,7 @@ def test_protocol_conformance_cuda(test, device): test.assertIsInstance(device.default_allocator, Allocator) if device.is_mempool_supported: test.assertIsInstance(device.mempool_allocator, Allocator) + test.assertIsInstance(wp.ManagedAllocator(), Allocator) add_function_test( @@ -265,6 +296,46 @@ def test_scoped_allocator(test, device): test.assertIs(wp.get_device_allocator(device), original) +def test_managed_allocator_allocates_on_selected_device(test, device): + """ManagedAllocator uses the CUDA device selected by ScopedAllocator.""" + + if not device.is_managed_memory_supported: + test.skipTest(f"{device} does not support CUDA managed memory") + + managed = wp.ManagedAllocator() + + with wp.ScopedAllocator(device, managed): + a = wp.zeros(8, dtype=wp.float32, device=device) + + test.assertEqual(a.device, device) + test.assertFalse(a.pinned) + test.assertIs(a.memory_kind, wp.MemoryKind.CUDA_MANAGED) + np.testing.assert_allclose(a.numpy(), np.zeros(8, dtype=np.float32)) + + +@unittest.skipUnless(wp.get_cuda_device_count() >= 2, "Multi-GPU not available") +def test_managed_allocator_deallocates_from_recorded_context(test, device): + """ManagedAllocator frees with the pointer's owner context, not the current one.""" + + if not device.is_managed_memory_supported: + test.skipTest(f"{device} does not support CUDA managed memory") + + other = wp.get_device(f"cuda:{(device.ordinal + 1) % wp.get_cuda_device_count()}") + + managed = wp.ManagedAllocator() + with device.context_guard: + ptr = managed.allocate(16) + + try: + with other.context_guard: + managed.deallocate(ptr, 16) + ptr = None + finally: + if ptr: + with device.context_guard: + managed.deallocate(ptr, 16) + + def test_scoped_allocator_restores_on_exception(test, device): """ScopedAllocator restores allocator even if body raises.""" device = wp.get_device(device) @@ -326,6 +397,19 @@ def test_zero_size_allocation(test, device): ]: add_function_test(TestCustomAllocator, fn.__name__, fn, devices=cuda_test_devices) +add_function_test( + TestCustomAllocator, + "test_managed_allocator_allocates_on_selected_device", + test_managed_allocator_allocates_on_selected_device, + devices=cuda_test_devices, +) + +add_function_test( + TestCustomAllocator, + "test_managed_allocator_deallocates_from_recorded_context", + test_managed_allocator_deallocates_from_recorded_context, + devices=cuda_test_devices, +) # -- RMM allocator ----------------------------------------------------------