What's the problem?
An ABI mismatch between C++ and Python definitions of device functions operating on custom (a.k.a "storage" or "struct") types leads to silent failures, illegal memory accesses, etc.,
Example 1
One example of this is #3789.
Example 2
If we change the existing Python test for reduce_into to operate on structs composed of int8 values, rather than int32, this fails:
Diff
diff --git a/python/cuda_parallel/tests/test_reduce_api.py b/python/cuda_parallel/tests/test_reduce_api.py
index 6087a263e..e5424363f 100644
--- a/python/cuda_parallel/tests/test_reduce_api.py
+++ b/python/cuda_parallel/tests/test_reduce_api.py
@@ -190,14 +190,14 @@ def test_reduce_struct_type():
@gpu_struct
class Pixel:
- r: np.int32
- g: np.int32
- b: np.int32
+ r: np.int8
+ g: np.int8
+ b: np.int8
def max_g_value(x, y):
return x if x.g > y.g else y
- d_rgb = cp.random.randint(0, 256, (10, 3), dtype=np.int32).view(Pixel.dtype)
+ d_rgb = cp.random.randint(0, 64, (10, 3), dtype=np.int8).view(Pixel.dtype)
d_out = cp.empty(1, Pixel.dtype)
h_init = Pixel(0, 0, 0)
@@ -209,7 +209,7 @@ def test_reduce_struct_type():
_ = reduce_into(d_temp_storage, d_rgb, d_out, d_rgb.size, h_init)
h_rgb = d_rgb.get()
- expected = h_rgb[h_rgb.view("int32")[:, 1].argmax()]
+ expected = h_rgb[h_rgb.view("int8")[:, 1].argmax()]
np.testing.assert_equal(expected["g"], d_out.get()["g"])
# example-end reduce-struct
Test Output
_______________________________________________________________________________ test_reduce_struct_type _______________________________________________________________________________
def test_reduce_struct_type():
# example-begin reduce-struct
import cupy as cp
import numpy as np
from cuda.parallel.experimental import algorithms
from cuda.parallel.experimental.struct import gpu_struct
@gpu_struct
class Pixel:
r: np.int8
g: np.int8
b: np.int8
def max_g_value(x, y):
return x if x.g > y.g else y
d_rgb = cp.random.randint(0, 64, (10, 3), dtype=np.int8).view(Pixel.dtype)
d_out = cp.empty(1, Pixel.dtype)
h_init = Pixel(0, 0, 0)
reduce_into = algorithms.reduce_into(d_rgb, d_out, max_g_value, h_init)
temp_storage_bytes = reduce_into(None, d_rgb, d_out, d_rgb.size, h_init)
d_temp_storage = cp.empty(temp_storage_bytes, dtype=np.uint8)
_ = reduce_into(d_temp_storage, d_rgb, d_out, d_rgb.size, h_init)
h_rgb = d_rgb.get()
expected = h_rgb[h_rgb.view("int8")[:, 1].argmax()]
> np.testing.assert_equal(expected["g"], d_out.get()["g"])
tests/test_reduce_api.py:214:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _../../../../miniforge3/envs/cupy-dev/lib/python3.10/site-packages/numpy/_utils/__init__.py:85: in wrapper
return fun(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = (<built-in function eq>, array([62], dtype=int8), array([0], dtype=int8)), kwds = {'err_msg': '', 'header': 'Arrays are not equal', 'strict': False, 'verbose': True}
@wraps(func)
def inner(*args, **kwds):
with self._recreate_cm():
> return func(*args, **kwds)
E AssertionError:
E Arrays are not equal
E
E Mismatched elements: 1 / 1 (100%)
E Max absolute difference among violations: 62
E Max relative difference among violations: inf
E ACTUAL: array([62], dtype=int8)
E DESIRED: array([0], dtype=int8)
../../../../miniforge3/envs/cupy-dev/lib/python3.10/contextlib.py:79: AssertionError
=============================================================================== short test summary info ===============================================================================FAILED tests/test_reduce_api.py::test_reduce_struct_type - AssertionError:
Why is this happening?
Consider an extern "C" device function accepting two structs as inputs:
struct MyStruct {
char a;
char b;
char c;
};
extern "C" __device__ void foo(MyStruct a, MyStruct b) {
}
Here's the corresponding PTX:
.visible .func foo(
.param .align 1 .b8 foo_param_0[3],
.param .align 1 .b8 foo_param_1[3]
)
{
ret;
}
Now, consider defining this device function using numba.cuda, using @gpu_struct to define the type of the arguments, and keeping in mind that gpu_struct uses StructModel to define the underlying numba type:
import numba
from numba import cuda
import numpy as np
from cuda.parallel.experimental.struct import gpu_struct
@gpu_struct
class MyStruct:
a: np.int8
b: np.int8
c: np.int8
def op(a, b):
pass
tp = numba.typeof(MyStruct(1, 2, 3))
ptx, _ = cuda.compile(op, (tp, tp))
print(ptx)
Here's the output of the above script:
.visible .func (.param .b64 func_retval0) op(
.param .b32 op_param_0,
.param .b32 op_param_1,
.param .b32 op_param_2,
.param .b32 op_param_3,
.param .b32 op_param_4,
.param .b32 op_param_5
)
{
.reg .b64 %rd<2>;
mov.u64 %rd1, 0;
st.param.b64 [func_retval0+0], %rd1;
ret;
}
Comparing the two PTXs, we see an ABI difference: on the C++ side, struct arguments are of type .b8[N] where N is the size of the struct. On the numba side however, we note that structs have been decomposed into their members, and the function accepts a .b32 for each member. (reason for promotion to 32-bit).
How to fix it?
In an offline sync with @gevtushenko, we decided that the appropriate way to fix this would be for operators in C++ to take pointer arguments (rather than by value). Thus, instead of:
extern "C" __device__ {0} OP_NAME(VALUE_T lhs, VALUE_T rhs);
struct op_wrapper {{
__device__ {0} operator()(VALUE_T lhs, VALUE_T rhs) const {{
return OP_NAME(lhs, rhs);
}}
}};
We would have:
extern "C" __device__ void OP_NAME(VALUE_T* lhs, VALUE_T* rhs, {0}* out);
struct op_wrapper {{
__device__ {0} operator()(VALUE_T lhs, VALUE_T rhs) const {{
{0} ret;
OP_NAME(&lhs, &rhs, &ret);
return ret;
}}
}};
On the numba side we can take the user provided binary op and wrap it in an op with the above signature.
I have confirmed that this fix works and resolves the issues we're seeing.
What's the problem?
An ABI mismatch between C++ and Python definitions of device functions operating on custom (a.k.a "storage" or "struct") types leads to silent failures, illegal memory accesses, etc.,
Example 1
One example of this is #3789.
Example 2
If we change the existing Python test for
reduce_intoto operate on structs composed ofint8values, rather thanint32, this fails:Diff
Test Output
_______________________________________________________________________________ test_reduce_struct_type _______________________________________________________________________________ def test_reduce_struct_type(): # example-begin reduce-struct import cupy as cp import numpy as np from cuda.parallel.experimental import algorithms from cuda.parallel.experimental.struct import gpu_struct @gpu_struct class Pixel: r: np.int8 g: np.int8 b: np.int8 def max_g_value(x, y): return x if x.g > y.g else y d_rgb = cp.random.randint(0, 64, (10, 3), dtype=np.int8).view(Pixel.dtype) d_out = cp.empty(1, Pixel.dtype) h_init = Pixel(0, 0, 0) reduce_into = algorithms.reduce_into(d_rgb, d_out, max_g_value, h_init) temp_storage_bytes = reduce_into(None, d_rgb, d_out, d_rgb.size, h_init) d_temp_storage = cp.empty(temp_storage_bytes, dtype=np.uint8) _ = reduce_into(d_temp_storage, d_rgb, d_out, d_rgb.size, h_init) h_rgb = d_rgb.get() expected = h_rgb[h_rgb.view("int8")[:, 1].argmax()] > np.testing.assert_equal(expected["g"], d_out.get()["g"]) tests/test_reduce_api.py:214: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _../../../../miniforge3/envs/cupy-dev/lib/python3.10/site-packages/numpy/_utils/__init__.py:85: in wrapper return fun(*args, **kwargs) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (<built-in function eq>, array([62], dtype=int8), array([0], dtype=int8)), kwds = {'err_msg': '', 'header': 'Arrays are not equal', 'strict': False, 'verbose': True} @wraps(func) def inner(*args, **kwds): with self._recreate_cm(): > return func(*args, **kwds) E AssertionError: E Arrays are not equal E E Mismatched elements: 1 / 1 (100%) E Max absolute difference among violations: 62 E Max relative difference among violations: inf E ACTUAL: array([62], dtype=int8) E DESIRED: array([0], dtype=int8) ../../../../miniforge3/envs/cupy-dev/lib/python3.10/contextlib.py:79: AssertionError =============================================================================== short test summary info ===============================================================================FAILED tests/test_reduce_api.py::test_reduce_struct_type - AssertionError:Why is this happening?
Consider an extern "C" device function accepting two structs as inputs:
Here's the corresponding PTX:
Now, consider defining this device function using
numba.cuda, using@gpu_structto define the type of the arguments, and keeping in mind thatgpu_structusesStructModelto define the underlying numba type:Here's the output of the above script:
Comparing the two PTXs, we see an ABI difference: on the C++ side, struct arguments are of type
.b8[N]whereNis the size of the struct. On the numba side however, we note that structs have been decomposed into their members, and the function accepts a.b32for each member. (reason for promotion to 32-bit).How to fix it?
In an offline sync with @gevtushenko, we decided that the appropriate way to fix this would be for operators in C++ to take pointer arguments (rather than by value). Thus, instead of:
We would have:
On the numba side we can take the user provided binary op and wrap it in an op with the above signature.
I have confirmed that this fix works and resolves the issues we're seeing.