[EXPERIMENT] Replace `numba-cuda` with `numba-cuda-mlir` by shwina · Pull Request #9421 · NVIDIA/cccl

shwina · 2026-06-12T12:51:36Z

Description

closes

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

cuda.compute is migrating its JIT/struct machinery from numba-cuda to numba-cuda-mlir (the MLIR-based successor). This first step adds the dependency and a single import surface; no behavior changes yet. - pyproject: add numba-cuda-mlir[cu12]/[cu13] to the cu12/cu13/sysctk runtime extras, alongside numba-cuda (which is still needed by _compile_op_to_llvm_bitcode on the v2/HostJIT path). Ignore numba_cuda_mlir.* in mypy like numba.*. - _mlir.py: central re-export of the numba-cuda-mlir symbols the migration uses (cuda.compile, type system, typing/lowering extension API, data models, MLIR builder + llvm/arith dialects), plus small from_numpy_dtype/as_numpy_dtype/struct_field_position helpers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@intrinsic

Move the user-operator compilation path off numba-cuda and onto numba-cuda-mlir (the MLIR-based successor). The gpu_struct typing/ lowering machinery still uses numba-cuda and is migrated separately. _odr_helpers.py: the void* operator wrappers are now ordinary Python device functions compiled with abi="c", instead of hand-written @intrinsic LLVM-IR codegen. A void* argument is a typed CPointer parameter (ABI-identical to void*); loads/stores are ptr[0] indexing; numba-cuda-mlir inlines the user op into the wrapper. The unused iterator advance/dereference wrappers are dropped (iterators compile their device code via C++, not numba). Stateful state is unpacked from the packed void* via a CPointer(CPointer(dtype)) view; heterogeneous state dtypes are rejected (no pure-Python int->typed-pointer cast). _compile_op_to_llvm_bitcode: numba-cuda-mlir's cuda.compile only emits ptx/ltoir, so the v2 (HostJIT) LLVM bitcode is produced by extracting LLVM IR from its internal MLIR -> LLVM translation (one step before libnvvm) and lowering that to bitcode with llvmlite. _jit.py: op compilation, return-type inference, stateful-op compilation, and the POD/pointer TypeDescriptor<->numba conversions now use numba-cuda-mlir (via the _mlir surface). Both v1 (ltoir) and v2 (bitcode) compile the same numba-cuda-mlir-jitted wrapper. _mlir.py: add compile_to_llvm_ir() encapsulating the MLIR->LLVM IR extraction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Reimplement the gpu_struct type machinery against numba-cuda-mlir's MLIR extension API, completing the migration: _jit.py no longer imports classic numba at all (numba-cuda stays a dependency only for cuda.coop). Typing: - The struct type subclasses numba_cuda_mlir.types.Type; tuple/struct conversions still use can_convert_from + Conversion.safe. - Data model: register_model + PrimitiveModel building the backend type as an MLIR llvm.StructType.new_identified over the fields' MLIR value types (replaces numba-cuda's models.StructModel). - Field access typing uses an AttributeTemplate via typing_registry (replaces make_attribute_wrapper, which has no MLIR equivalent). - Constructor typing uses a ConcreteTemplate registered with typing_registry.register_global (replaces the numba.cuda cudadecl registry). Lowering (MLIR instead of llvmlite/cgutils): - Field getattr: lower_getattr_generic + llvm.extractvalue. - Constructor: lower() + llvm.UndefOp/insertvalue. - tuple->struct and struct->struct casts: lower_cast + llvm extract/insertvalue (a tuple value is a Python sequence of MLIR values pre-concretization). _mlir.py: export Conversion. Validated headlessly: struct construct + field access compile to LTO-IR via the same MLIR pattern. The aggregate casts are the area most in need of validation against the struct test suite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- test_void_ptr_wrapper_validation.py: rewrite against the new _make_wrapper_name API (the @intrinsic-era _ArgMode / _ArgSpec / _create_void_ptr_wrapper internals were removed); keep the sanitize_identifier coverage. - test_merge_sort.py: xfail the unsigned compare_op cases. The test comparator np.uint8(lhs < rhs) hits a numba-cuda-mlir bug that miscompiles unsigned integer comparison as signed; signed/float comparators are unaffected. Remove once numba-cuda-mlir fixes it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-06-12T13:03:26Z

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Overview

This experimental PR begins migrating NVIDIA CCCL's JIT compilation and struct machinery from the classic numba-cuda to numba-cuda-mlir, modernizing the compilation pipeline to use MLIR-based lowering and addressing upstream limitations in the process.

Key Changes

New Central Export Surface

python/cuda_cccl/cuda/compute/_mlir.py (152 lines added)
A new public module serving as the central re-export point for numba-cuda-mlir functionality. Exports key components including cuda, types, errors, numpy_support, signature, and lowering/typing infrastructure (overload, lowering_registry, typing_registry, etc.). Provides wrapper helpers for NumPy dtype conversion (from_numpy_dtype, as_numpy_dtype) and a new compile_to_llvm_ir(pyfunc, sig, abi_name: str) -> str function that compiles device functions through the MLIR pipeline, extracts LLVM IR text, and returns bitcode-ready output. Includes an MLIR attribute helper (struct_field_position).

JIT Compilation Migration

python/cuda_cccl/cuda/compute/_jit.py (+276/-209 lines)
Comprehensive refactoring replacing direct Numba APIs with the _mlir pipeline:
- _compile_op_to_llvm_bitcode rewritten to obtain LLVM IR via _mlir.compile_to_llvm_ir, parse/verify with llvmlite, emit debug artifacts, and return bitcode bytes
- CCCL struct typing/casting system reimplemented on _mlir infrastructure; _StructBase now derives from _mlir type base
- Struct field validation/conversion and can_convert_from logic updated to use _mlir tuple/literal types and conversions
- Attribute access lowering implemented via _mlir typing registry templates and MLIR lowerings
- Struct indexing and construction lowering added through _mlir.overload and _mlir.lowering_registry
- Tuple-to-struct and struct-to-struct casts rewritten to operate on MLIR values using _mlir.extractvalue and convert (replacing Numba cgutils)
- Return-type inference and op compilation switched to _mlir.cuda.compile with explicit device=True, abi="c", and output="ltoir" parameters
- Stateful op compilation refactored to pass state as typed CPointer arguments; wrapper creation now uses create_stateful_op_void_ptr_wrapper with state_dtypes

Wrapper Generation Refactoring

python/cuda_cccl/cuda/compute/_odr_helpers.py (+143/-299 lines)
Replaced LLVM IR/intrinsic-based codegen with a Python source construction approach for void* wrappers:
- New helper functions create unique wrapper symbol names and build wrapper device functions via exec()-ed Python source
- create_op_void_ptr_wrapper reimplemented to call a cuda.jit(device=True)-compiled operator and store results through result[0]
- create_stateful_op_void_ptr_wrapper API simplified: now accepts state_dtypes instead of state_array_types/state_info, enforces uniform captured-dtype support
- Removed wrapper constructors: create_advance_void_ptr_wrapper, create_input_dereference_void_ptr_wrapper, create_output_dereference_void_ptr_wrapper
- Removed argument-mode enum, state unpacking LLVM IR helpers, and intrinsic-based wrapper codegen

Dependency and Configuration

python/cuda_cccl/pyproject.toml (+13/-2 lines)
Added numba-cuda-mlir[cu12]>=0.3.0 and numba-cuda-mlir[cu13]>=0.3.0 as dependencies across all CUDA 12/13 extras variants. Updated Mypy configuration to ignore numba_cuda_mlir.* module patterns.

Test Infrastructure Updates

python/cuda_cccl/tests/compute/conftest.py (+111 lines)
New centralized test infrastructure to mark compute tests as expected-to-fail for known numba-cuda-mlir upstream issues. Adds _upstream_xfail_reason() helper that matches test names/nodeids against known failure conditions and applies pytest.mark.xfail(strict=False) accordingly.
python/cuda_cccl/tests/conftest.py (+45 lines)
Shared Pytest configuration for the test suite. Introduces _EXAMPLE_XFAILS mapping to mark specific compute example tests as xfail based on numba-cuda-mlir limitations, with pytest_collection_modifyitems hook to attach xfail markers.
python/cuda_cccl/tests/compute/test_void_ptr_wrapper_validation.py (+23/-37 lines)
Test rewritten from validating _create_void_ptr_wrapper to validating _make_wrapper_name post-sanitization. Updated imports and test suite now covers acceptance and rejection cases for unique identifier generation.

Technical Details

Compilation Pipeline

The migration from classic Numba to numba-cuda-mlir updates the operator compilation workflow:

Compiles to optimized MLIR with ltoir output
Extracts LLVM IR text from the MLIR module
Parses and verifies IR with llvmlite
Generates bitcode suitable for offline compilation

Struct Type Handling

Struct types now use MLIR's native llvm.StructType with MLIR-based lowering operations (extractvalue, insertvalue, undef) rather than Numba's cgutils struct proxies. The data model registration uses AttributeTemplate, ConcreteTemplate, and register_model from the _mlir infrastructure.

Known Issues

A merge_sort test has been xfailed for specific unsigned-comparison cases due to a known numba-cuda-mlir miscompile.

Breaking Changes

create_stateful_op_void_ptr_wrapper signature changed: now accepts state_dtypes (list of MLIR dtypes) instead of state_array_types and state_info
Removed wrapper functions: create_advance_void_ptr_wrapper, create_input_dereference_void_ptr_wrapper, create_output_dereference_void_ptr_wrapper
All JIT compilation paths now route through _mlir.cuda.compile instead of direct Numba APIs

Notes

The PR introduction marks this as experimental. While the public API surface (e.g., to_jit_op_adapter) remains unchanged, internal compilation machinery is significantly restructured. Test infrastructure updates anticipate ongoing upstream issues with numba-cuda-mlir during stabilization.

Walkthrough

This PR migrates CCCL's JIT compilation infrastructure from direct Numba APIs to numba-cuda-mlir. The changes include a new _mlir.py module providing the backend entry point, wrapper codegen refactored from Numba intrinsics to Python source generation, struct type system ported to MLIR, type conversions updated to use MLIR type infrastructure, and all op compilation paths switched to _mlir.cuda.compile.

Changes

CCCL JIT migration from Numba to numba-cuda-mlir

Layer / File(s)	Summary
MLIR backend module and dtype helpers `python/cuda_cccl/cuda/compute/_jit.py`, `python/cuda_cccl/cuda/compute/_mlir.py`	New `_mlir.py` module re-exports numba-cuda-mlir components (cuda, types, errors, numpy_support, signature); provides dtype converters (`from_numpy_dtype`, `as_numpy_dtype`) and `compile_to_llvm_ir(pyfunc, sig, abi_name)` that compiles via MLIR pipeline to optimized LLVM IR text. Imports updated in `_jit.py` to use `_mlir` backend.
Wrapper generation refactor to Python source `python/cuda_cccl/cuda/compute/_odr_helpers.py`	Replaces Numba intrinsic/LLVM IR codegen with Python source generation; implements `_make_wrapper_name` for unique valid identifiers; reimplements `create_op_void_ptr_wrapper` and `create_stateful_op_void_ptr_wrapper` to build device wrappers via `exec`, with new API signature accepting `state_dtypes` instead of `state_array_types` and `state_info`.
Struct type system migration to MLIR `python/cuda_cccl/cuda/compute/_jit.py`	`_StructBase` base class changed from `numba.types.Type` to `_mlir.types.Type`; struct field validation and `can_convert_from` use `_mlir.types.UniTuple`/`_mlir.types.Tuple` and `_mlir.Conversion`; attribute access lowering implemented via `_mlir` typing registry templates; struct getitem, construction, and tuple/struct casts added via `_mlir.overload` and `_mlir.lower_cast` using MLIR extractvalue/convert.
Type descriptor conversions using MLIR APIs `python/cuda_cccl/cuda/compute/_jit.py`	`type_descriptor_to_numba` now passes through `_mlir.types.Type` and creates pointers via `_mlir.types.CPointer`; `_convert_type_descriptor_to_numba` uses `_mlir.as_numba_type` and `_mlir.from_numpy_dtype`; `_ensure_function_structs_registered` and `_numba_type_to_type_descriptor` updated to use `_mlir` APIs instead of Numba dtype converters.
LLVM bitcode generation and op compilation `python/cuda_cccl/cuda/compute/_jit.py`	`_compile_op_to_llvm_bitcode` rewritten to obtain LLVM IR via `_mlir.compile_to_llvm_ir`, parse with `llvmlite`, emit optional debug artifacts, and return bitcode; `_infer_return_type` and `_compile_op_impl` switched to `_mlir.cuda.compile` with `device=True`, `abi="c"`, `output="ltoir"`.
Stateful op compilation with MLIR pointer types `python/cuda_cccl/cuda/compute/_jit.py`	`_compile_stateful_op` derives `state_dtypes` from NumPy array dtypes and builds `state_ptr_types` as `_mlir.types.CPointer`; infers output type via `_mlir.cuda.compile` with pointer-typed state inputs; creates wrapper using updated `create_stateful_op_void_ptr_wrapper(op, sig, state_dtypes)` API, removing prior Numba Array type and state_info bookkeeping.
Dependency and mypy configuration `python/cuda_cccl/pyproject.toml`	Adds `numba-cuda-mlir[cu12/cu13]>=0.3.0` to cu12, cu13, sysctk12, sysctk13 optional-dependency extras; adds `numba_cuda_mlir.*` to mypy override module ignore list.
Test infrastructure for xfail handling `python/cuda_cccl/tests/conftest.py`, `python/cuda_cccl/tests/compute/conftest.py`, `python/cuda_cccl/tests/compute/test_void_ptr_wrapper_validation.py`	Adds pytest hooks to mark compute example tests xfail based on known `numba-cuda-mlir` issues via `_EXAMPLE_XFAILS` mapping and `_upstream_xfail_reason` helper; updates wrapper validation tests to validate `_make_wrapper_name` and `sanitize_identifier` functions.

Possibly related issues

Replace the numba-cuda dependency with numba-cuda-mlir #9408: This PR implements the migration from Numba CUDA to numba-cuda-mlir directly addressed in that issue, rewriting compilation pipelines, struct typing, type conversions, and wrapper codegen to use the new backend.

Suggested reviewers

Jacobfaib
davebayer
miscco

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

python/cuda_cccl/tests/conftest.py (1)

35-45: ⚡ Quick win

suggestion: Name extraction logic is duplicated in tests/compute/conftest.py line 238. Consider extracting getattr(item, "originalname", None) or item.name.split("[")[0] into a shared helper function to ensure consistency if the logic changes.

python/cuda_cccl/tests/compute/conftest.py (2)

136-229: ⚖️ Poor tradeoff

suggestion: The _upstream_xfail_reason function spans 94 lines with nested conditionals checking five distinct issues. Consider splitting into issue-specific helper functions (e.g., _check_issue_123, _check_issue_121) to improve maintainability and testability. Each helper could return the reason string or None.

238-238: ⚡ Quick win

suggestion: Name extraction logic getattr(item, "originalname", None) or item.name.split("[")[0] is duplicated from tests/conftest.py line 37. Consider consolidating into a shared helper to avoid drift.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c3c453b3-c3dd-4683-b958-f22ef9e34b75

📥 Commits

Reviewing files that changed from the base of the PR and between 071630f and 5dacf4a.

📒 Files selected for processing (7)

python/cuda_cccl/cuda/compute/_jit.py
python/cuda_cccl/cuda/compute/_mlir.py
python/cuda_cccl/cuda/compute/_odr_helpers.py
python/cuda_cccl/pyproject.toml
python/cuda_cccl/tests/compute/conftest.py
python/cuda_cccl/tests/compute/test_void_ptr_wrapper_validation.py
python/cuda_cccl/tests/conftest.py

coderabbitai · 2026-06-12T13:03:29Z

        def can_convert_from(self, typingctx, other):
-            if isinstance(other, types.UniTuple):
+            if isinstance(other, _mlir.types.UniTuple):
                tuple_size = other.count
                if tuple_size == len(field_types):
-                    return Conversion.safe
+                    return _mlir.Conversion.safe



⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

important: Validate UniTuple element convertibility before returning Conversion.safe.

This branch accepts any same-length UniTuple, even when its element type cannot convert to every target field type. That makes heterogeneous struct casts type-check here and fail later in the cast lowering, unlike the _mlir.types.Tuple branch which already does the per-field check.

coderabbitai · 2026-06-12T13:03:29Z

+    # State arrays are passed to the (transformed) op as typed pointers; the op
+    # body indexes them (``state[i]``), which works on a CPointer.  See
+    # _odr_helpers.create_stateful_op_void_ptr_wrapper for how the packed state
+    # void* is unpacked into one CPointer per state array.
+    state_dtypes = [_mlir.from_numpy_dtype(get_dtype(s)) for s in state_arrays]
+    state_ptr_types = [_mlir.types.CPointer(dt) for dt in state_dtypes]



⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

important: Typing captured state arrays as CPointer silently drops array semantics.

The transformed operator body is unchanged, but each captured device array is now compiled as a bare pointer. Stateful operators that use .shape, multidimensional indexing, slicing, or any array attribute/method will stop typing/lowering because those operations are not available on CPointer. As per coding guidelines, python/cuda_cccl/**/*: Focus on Python API stability, CUDA array interoperability, memory ownership, JIT/NVRTC/nvJitLink behavior, package boundaries, user-defined operator correctness, tests, and examples.

Source: Coding guidelines

coderabbitai · 2026-06-12T13:03:29Z

+        _, return_type = _mlir.cuda.compile(
+            op,
+            all_numba_input_types,
+            device=True,
+            abi_info={"abi_name": abi_name},
+            output="ltoir",
        )
        # Convert return type to TypeDescriptor
-        output_type = cccl_types.from_numpy_dtype(
-            numba.np.numpy_support.as_dtype(return_type)
-        )
+        output_type = cccl_types.from_numpy_dtype(_mlir.as_numpy_dtype(return_type))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Stateful return-type inference no longer supports gpu_struct results.

This path converts return_type through _mlir.as_numpy_dtype, which only works for NumPy-backed scalar/POD types. The stateless path already uses _numba_type_to_type_descriptor; stateful operators returning a registered struct will now fail here or be reported as the wrong type.

coderabbitai · 2026-06-12T13:03:29Z

+    unique_state_dtypes = set(state_dtypes)
+    if len(unique_state_dtypes) > 1:
+        raise NotImplementedError(
+            "stateful operators that capture device arrays of differing dtypes "
+            f"are not supported (got {sorted(map(str, unique_state_dtypes))}); "
+            "all captured arrays must share a dtype"
+        )
+    state_dtype = state_dtypes[0]

+    op_device = cuda.jit(device=True)(op)

-def create_input_dereference_void_ptr_wrapper(deref_fn, state_ptr_type, value_type):
-    """Creates a wrapper function for input iterator dereference method.
+    # sig.args == (state_0, ..., state_{num_states-1}, input_0, ..., input_{K-1})
+    input_types = list(sig.args)[num_states:]
+    return_type = sig.return_type

-    The wrapper takes 2 void* arguments:
-    - state pointer
-    - result pointer (function writes result here)
-    """
-    arg_specs = [
-        _ArgSpec(state_ptr_type, _ArgMode.PTR),
-        _ArgSpec(types.CPointer(value_type), _ArgMode.PTR),
-    ]
-    inner_sig = types.void(state_ptr_type, types.CPointer(value_type))
-    return _create_void_ptr_wrapper(deref_fn, deref_fn.__name__, arg_specs, inner_sig)
+    wrapper_name = _make_wrapper_name(op.__name__)
+    input_names = [f"arg_{i}" for i in range(len(input_types))]

+    # states[j] reinterprets the j-th packed pointer as CPointer(state_dtype).
+    state_args = ", ".join(f"states[{j}]" for j in range(num_states))
+    input_args = ", ".join(f"{name}[0]" for name in input_names)
+    call_args = ", ".join(a for a in (state_args, input_args) if a)
+    reconstruct = _is_gpu_struct_type(return_type) and _op_returns_tuple(
+        op_device, sig.args
+    )
+    body, extra_namespace = _result_store_body(call_args, return_type, reconstruct)
+
+    wrapper_func = _build_wrapper(
+        wrapper_name,
+        ["states", *input_names, "result"],
+        body,
+        op_device,
+        extra_namespace,
+    )

-def create_output_dereference_void_ptr_wrapper(deref_fn, state_ptr_type, value_type):
-    """Creates a wrapper function for output iterator dereference method.
-
-    The wrapper takes 2 void* arguments:
-    - state pointer
-    - value pointer (value to write)
-    """
-    arg_specs = [
-        _ArgSpec(state_ptr_type, _ArgMode.PTR),
-        _ArgSpec(value_type, _ArgMode.LOAD),
-    ]
-    inner_sig = types.void(state_ptr_type, value_type)
-    return _create_void_ptr_wrapper(deref_fn, deref_fn.__name__, arg_specs, inner_sig)
+    wrapper_sig = types.void(
+        types.CPointer(types.CPointer(state_dtype)),
+        *(types.CPointer(t) for t in input_types),
+        types.CPointer(return_type),
+    )


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

important: Heterogeneous captured state arrays become unsupported in this wrapper path. _jit._compile_stateful_op still builds one pointer type per captured array, but this code collapses the packed states blob to CPointer(CPointer(state_dtype)) and raises on mixed state_dtypes. That makes previously valid stateful ops fail as soon as they capture arrays with different element types. Preserve per-slot pointer reconstruction here instead of requiring a single shared dtype, and add a mixed-dtype captured-state regression test once this is fixed. As per coding guidelines, "Focus on Python API stability, CUDA array interoperability, memory ownership, JIT/NVRTC/nvJitLink behavior, package boundaries, user-defined operator correctness, tests, and examples."

Source: Coding guidelines

coderabbitai · 2026-06-12T13:03:29Z

 cu12 = [
  "cuda-cccl[minimal-cu12]",
+  # numba / numba-cuda: used by cuda.coop (Numba-CUDA cooperative primitives).
  "numba>=0.60.0",
  "numba-cuda[cu12]>=0.23.0,!=0.27.*,!=0.28.*,!=0.29.*,!=0.30.0",
+  # numba-cuda-mlir: backend that JIT-compiles cuda.compute user operators and
+  # gpu_struct types (the MLIR-based successor to numba-cuda).
+  "numba-cuda-mlir[cu12]>=0.3.0",
 ]
 cu13 = [
  "cuda-cccl[minimal-cu13]",
+  # numba / numba-cuda: used by cuda.coop (Numba-CUDA cooperative primitives).
  "numba>=0.60.0",
-  "numba-cuda[cu13]>=0.23.0,!=0.27.*,!=0.28.*,!=0.29.*,!=0.30.0"
+  "numba-cuda[cu13]>=0.23.0,!=0.27.*,!=0.28.*,!=0.29.*,!=0.30.0",
+  # numba-cuda-mlir: backend that JIT-compiles cuda.compute user operators and
+  # gpu_struct types (the MLIR-based successor to numba-cuda).
+  "numba-cuda-mlir[cu13]>=0.3.0",
 ]
 sysctk12 = [
  "cuda-cccl[minimal-sysctk12]",
  "numba>=0.60.0",
-  "numba-cuda[cu12]>=0.23.0,!=0.27.*,!=0.28.*,!=0.29.*,!=0.30.0"
+  "numba-cuda[cu12]>=0.23.0,!=0.27.*,!=0.28.*,!=0.29.*,!=0.30.0",
+  "numba-cuda-mlir[cu12]>=0.3.0",
 ]
 sysctk13 = [
  "cuda-cccl[minimal-sysctk13]",
  "numba>=0.60.0",
  "numba-cuda[cu13]>=0.23.0,!=0.27.*,!=0.28.*,!=0.29.*,!=0.30.0",
+  "numba-cuda-mlir[cu13]>=0.3.0",
 ]


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify numba-cuda-mlir package, version, and extras exist on PyPI echo "=== Checking numba-cuda-mlir on PyPI ===" curl -s https://pypi.org/pypi/numba-cuda-mlir/json | jq -r ' "Latest version: " + .info.version, "Available versions: " + ([.releases | keys[] | select(. >= "0.3.0")] | join(", ")), "" ' echo "=== Checking for extras in version 0.3.0 ===" curl -s https://pypi.org/pypi/numba-cuda-mlir/0.3.0/json | jq -r ' .releases["0.3.0"][] | select(.packagetype == "bdist_wheel") | .filename ' | head -5 echo "" echo "=== Checking for cu12/cu13 in package metadata ===" curl -s https://pypi.org/pypi/numba-cuda-mlir/0.3.0/json | jq -r ' .info.requires_dist // [] | map(select(contains("cu12") or contains("cu13"))) | .[] ' | head -10

Repository: NVIDIA/cccl

Length of output: 597

Add an upper bound for numba-cuda-mlir in cu12/cu13 extras
python/cuda_cccl/pyproject.toml currently specifies numba-cuda-mlir[cu12/cu13] >= 0.3.0 without an upper cap; add an upper bound (e.g., <0.4.0) or pin to ==0.3.0 to avoid future breaking dependency changes.

github-actions · 2026-06-12T13:39:19Z

😬 CI Workflow Results

🟥 Finished in 46m 07s: Pass: 29%/51 | Total: 3h 48m | Max: 36m 34s

See results here.

shwina and others added 4 commits June 12, 2026 06:26

shwina requested a review from a team as a code owner June 12, 2026 12:51

shwina requested a review from tpn June 12, 2026 12:51

github-project-automation Bot added this to CCCL Jun 12, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 12, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 12, 2026

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EXPERIMENT] Replace `numba-cuda` with `numba-cuda-mlir`#9421

[EXPERIMENT] Replace `numba-cuda` with `numba-cuda-mlir`#9421
shwina wants to merge 4 commits into
NVIDIA:mainfrom
shwina:python/numba-mlir-3-gpu-struct

shwina commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

coderabbitai Bot Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shwina commented Jun 12, 2026

Description

Checklist

Uh oh!

coderabbitai Bot commented Jun 12, 2026

Overview

Key Changes

New Central Export Surface

JIT Compilation Migration

Wrapper Generation Refactoring

Dependency and Configuration

Test Infrastructure Updates

Technical Details

Compilation Pipeline

Struct Type Handling

Known Issues

Breaking Changes

Notes

Walkthrough

Changes

Possibly related issues

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 12, 2026

😬 CI Workflow Results

🟥 Finished in 46m 07s: Pass: 29%/51 | Total: 3h 48m | Max: 36m 34s

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant