[BUG] cuda.parallel: ABI mismatch between C++ and Python for struct types

## What's the problem?

An ABI mismatch between C++ and Python definitions of device functions operating on custom (a.k.a "storage" or "struct") types leads to silent failures, illegal memory accesses, etc.,

### Example 1

One example of this is https://github.com/NVIDIA/cccl/issues/3789.

### Example 2

If we change the existing Python test for `reduce_into` to operate on structs composed of `int8` values, rather than `int32`, this fails:

<details>
<summary>Diff </summary>

```diff
diff --git a/python/cuda_parallel/tests/test_reduce_api.py b/python/cuda_parallel/tests/test_reduce_api.py
index 6087a263e..e5424363f 100644
--- a/python/cuda_parallel/tests/test_reduce_api.py
+++ b/python/cuda_parallel/tests/test_reduce_api.py
@@ -190,14 +190,14 @@ def test_reduce_struct_type():
 
     @gpu_struct
     class Pixel:
-        r: np.int32
-        g: np.int32
-        b: np.int32
+        r: np.int8
+        g: np.int8
+        b: np.int8
 
     def max_g_value(x, y):
         return x if x.g > y.g else y
 
-    d_rgb = cp.random.randint(0, 256, (10, 3), dtype=np.int32).view(Pixel.dtype)
+    d_rgb = cp.random.randint(0, 64, (10, 3), dtype=np.int8).view(Pixel.dtype)
     d_out = cp.empty(1, Pixel.dtype)
 
     h_init = Pixel(0, 0, 0)
@@ -209,7 +209,7 @@ def test_reduce_struct_type():
     _ = reduce_into(d_temp_storage, d_rgb, d_out, d_rgb.size, h_init)
 
     h_rgb = d_rgb.get()
-    expected = h_rgb[h_rgb.view("int32")[:, 1].argmax()]
+    expected = h_rgb[h_rgb.view("int8")[:, 1].argmax()]
 
     np.testing.assert_equal(expected["g"], d_out.get()["g"])
     # example-end reduce-struct
```

</details>

<details>

<summary> Test Output </summary>

```bash
_______________________________________________________________________________ test_reduce_struct_type _______________________________________________________________________________
    def test_reduce_struct_type():
        # example-begin reduce-struct
        import cupy as cp
        import numpy as np
    
        from cuda.parallel.experimental import algorithms
        from cuda.parallel.experimental.struct import gpu_struct
    
        @gpu_struct
        class Pixel:
            r: np.int8
            g: np.int8
            b: np.int8
    
        def max_g_value(x, y):
            return x if x.g > y.g else y
    
        d_rgb = cp.random.randint(0, 64, (10, 3), dtype=np.int8).view(Pixel.dtype)
        d_out = cp.empty(1, Pixel.dtype)
    
        h_init = Pixel(0, 0, 0)
    
        reduce_into = algorithms.reduce_into(d_rgb, d_out, max_g_value, h_init)
        temp_storage_bytes = reduce_into(None, d_rgb, d_out, d_rgb.size, h_init)
    
        d_temp_storage = cp.empty(temp_storage_bytes, dtype=np.uint8)
        _ = reduce_into(d_temp_storage, d_rgb, d_out, d_rgb.size, h_init)
    
        h_rgb = d_rgb.get()
        expected = h_rgb[h_rgb.view("int8")[:, 1].argmax()]
    
>       np.testing.assert_equal(expected["g"], d_out.get()["g"])

tests/test_reduce_api.py:214: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _../../../../miniforge3/envs/cupy-dev/lib/python3.10/site-packages/numpy/_utils/__init__.py:85: in wrapper
    return fun(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = (<built-in function eq>, array([62], dtype=int8), array([0], dtype=int8)), kwds = {'err_msg': '', 'header': 'Arrays are not equal', 'strict': False, 'verbose': True}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)
E           AssertionError: 
E           Arrays are not equal
E           
E           Mismatched elements: 1 / 1 (100%)
E           Max absolute difference among violations: 62
E           Max relative difference among violations: inf
E            ACTUAL: array([62], dtype=int8)
E            DESIRED: array([0], dtype=int8)

../../../../miniforge3/envs/cupy-dev/lib/python3.10/contextlib.py:79: AssertionError
=============================================================================== short test summary info ===============================================================================FAILED tests/test_reduce_api.py::test_reduce_struct_type - AssertionError:

```

</details>

## Why is this happening?

Consider an extern "C" device function accepting two structs as inputs:

```c++
struct MyStruct {
    char a;
    char b;
    char c;
};

extern "C" __device__ void foo(MyStruct a, MyStruct b) {

}
```

Here's the corresponding PTX:

```ptx
.visible .func foo(
        .param .align 1 .b8 foo_param_0[3],
        .param .align 1 .b8 foo_param_1[3]
)
{

        ret;

}
```

Now, consider defining this device function using `numba.cuda`, using `@gpu_struct` to define the type of the arguments, and keeping in mind that `gpu_struct`  uses [`StructModel`](https://github.com/numba/numba/blob/c21aa9273ef4298392695b1f4613d29456b53e5c/numba/core/datamodel/old_models.py#L514) to define the underlying numba type:

```python
import numba
from numba import cuda
import numpy as np

from cuda.parallel.experimental.struct import gpu_struct

@gpu_struct
class MyStruct:
    a: np.int8
    b: np.int8
    c: np.int8

def op(a, b):
    pass

tp = numba.typeof(MyStruct(1, 2, 3))
ptx, _ = cuda.compile(op, (tp, tp))

print(ptx)
```

Here's the output of the above script:

```
.visible .func  (.param .b64 func_retval0) op(
        .param .b32 op_param_0,
        .param .b32 op_param_1,
        .param .b32 op_param_2,
        .param .b32 op_param_3,
        .param .b32 op_param_4,
        .param .b32 op_param_5
)
{
        .reg .b64       %rd<2>;


        mov.u64         %rd1, 0;
        st.param.b64    [func_retval0+0], %rd1;
        ret;

}
```

Comparing the two PTXs, we see an ABI difference: on the C++ side, struct arguments are of type `.b8[N]` where `N` is the size of the struct. On the numba side however, we note that structs have been decomposed into their members, and the function accepts a `.b32` for each member. ([reason for promotion to 32-bit](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=abi#function-declarations-and-definitions)).

## How to fix it?

In an offline sync with @gevtushenko, we decided that the appropriate way to fix this would be for [operators](https://github.com/NVIDIA/cccl/blob/main/c/parallel/src/kernels/operators.cpp#L36-L43) in C++ to take pointer arguments (rather than by value). Thus, instead of:

```c++
extern "C" __device__ {0} OP_NAME(VALUE_T lhs, VALUE_T rhs);

struct op_wrapper {{
  __device__ {0} operator()(VALUE_T lhs, VALUE_T rhs) const {{
   return OP_NAME(lhs, rhs);
  }}
}};
```

We would have:

```c++
extern "C" __device__ void OP_NAME(VALUE_T* lhs, VALUE_T* rhs, {0}* out);
 struct op_wrapper {{
   __device__ {0} operator()(VALUE_T lhs, VALUE_T rhs) const {{
    {0} ret;
    OP_NAME(&lhs, &rhs, &ret);
    return ret;
  }}
}};
```

On the numba side we can take the user provided binary op and wrap it in an op with the above signature.

I have confirmed that this fix works and resolves the issues we're seeing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cuda.parallel: ABI mismatch between C++ and Python for struct types #4248

What's the problem?

Example 1

Example 2

Why is this happening?

How to fix it?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] cuda.parallel: ABI mismatch between C++ and Python for struct types #4248

Description

What's the problem?

Example 1

Example 2

Why is this happening?

How to fix it?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions