Skip to content

feat: add stable install tests#837

Open
gforsyth wants to merge 21 commits intorapidsai:mainfrom
gforsyth:stable_install_testing
Open

feat: add stable install tests#837
gforsyth wants to merge 21 commits intorapidsai:mainfrom
gforsyth:stable_install_testing

Conversation

@gforsyth
Copy link
Copy Markdown
Contributor

@gforsyth gforsyth commented Apr 7, 2026

This is a first pass at rapidsai/build-planning#227.

It adds nightly tests that try to install the latest stable version of RAPIDS on all supported Python and CUDA versions, on amd64 and arm64, with both pip and conda.

The tests are currently limited to:

  • does it install successfully (and for pip, can we install the available packages from upstream pypi)
  • can we import the installed libraries without any symbol lookup errors

This should wait until after the 26.04 release before merging

@gforsyth gforsyth requested a review from a team as a code owner April 7, 2026 20:40
@gforsyth gforsyth requested a review from bdice April 7, 2026 20:40
Copy link
Copy Markdown
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

I nosed in and left a couple comments for your consideration... I've spent the last couple weeks deep in RAPIDS' wheel-testing setup (e.g. for rapidsai/build-planning#256), and that informed some of what I was looking for here.

Overall I do really like the structure! Especially running the jobs in parallel and using the nvidia/cuda:*-base-* image for wheels.

And I agree that "install everything + import the libraries" is a good target for this first pass.


set -euo pipefail

STABLE_RAPIDS_VERSION="26.4.*"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


STABLE_RAPIDS_VERSION="26.4.*"
SUPPORTED_PYTHON_VERSIONS=(3.11 3.12 3.13 3.14)
SUPPORTED_CUDA_VERSIONS=("cu12" "cu13")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider also having this test the bounds (12.2, 12.9, 13.0, 13.1)?

That could be done by adding a requirement like this to the pip install calls:

cuda-toolkit[all]==${cuda_major_minor}.*

That'd be a really helpful extension of rapidsai/build-planning#256, and I think it'd help us catch conflicts that aren't easily caught in individual projects' CI.

"cuvs-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}"
--extra-index-url=https://pypi.nvidia.com
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comments on conda, I think this would be a more powerful test if it combined all the imports into one.

I'd structure it roughly like this:

# get all the pypi.nvidia.com stuff
wheels_dir=$(mktemp -d)
pip download \
  --isolated \
  --index-url https://pypi.nvidia.com \
  --prefer-binary \
  --no-deps \
  -d "${wheels_dir}" \
  "cucim-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
  "cugraph-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
  "cuvs-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
  "cuxfilter-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
  "libcugraph-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
  "libcuvs-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
  "pylibcugraph-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}"

pip install \
  --isolated \
  --index-url https://pypi.org/simple \
  --prefer-binary \
  "${PIP_INSTALL_PYPI[@]}" \
  "${wheels_dir}"/*.whl

python -c "import cudf; import dask_cudf; ...; import cuvs"

Would you consider something like that?

container_image: "nvidia/cuda:12.9.1-base-ubuntu-24.04"
script: |
./ci/stable_install/install_and_test_pip.sh --cuda cu12
test-stable-install-pip-cuda-13-amd64:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could cut the number of jobs here in half by having matrix elements inside of each of these for amd64 and arm64. Like this: https://github.com/rapidsai/cuvs/blob/cbb9db5697eeebcb03a6ed198b7d4386ce14a301/.github/workflows/pr.yaml#L365-L374

Will you consider that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That language was a little imprecise... cut the number of configurations in half. The number of jobs would be unchanged.

gforsyth and others added 2 commits April 8, 2026 09:39
Comment on lines +6 to +21
function testImports {
unset imports
while [[ $# -gt 0 ]]; do
# run standalone import test
rapids-logger "Standalone import test for $1"
python -c "import $1" || rapids-logger "Test failed for: $1"
rapids-logger "Passed"
# add import to array for combined import test before shifting
imports+=("$1")
shift
done
import_cmd=$(printf "import %s; " "${imports[@]}")
rapids-logger "Combined import test for: ${imports[*]}"
python -c "${import_cmd}" || rapids-logger "Test failed for: ${imports[*]}"
rapids-logger "Passed"
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does enough now that I broke it out into a separate script so it can be sourced by both test scripts.

Imports each library individually in separate Python sessions, then imports all of them sequentially in the same Python session.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it a lot!!!

Comment on lines +84 to +98
WHEELS_DIR=$(mktemp -d)
pip download \
--isolated \
--index-url https://pypi.nvidia.com \
--prefer-binary \
--no-deps \
-d "${WHEELS_DIR}" \
"cucim-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"cugraph-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"cuvs-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"cuxfilter-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"libcugraph-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"libcuvs-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"nx-cugraph-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"pylibcugraph-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does an unnecessary double-download for each CUDA major version, but avoiding it requires annoying and error-prone state-tracking

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling it out. I think that's totally fine.

Copy link
Copy Markdown
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the import changes, I think we'll be really happy to have those!

I left a couple more suggestions, and one that I think is worth blocking the PR over.

I'd love to also see actual runs of these jobs. Since this was opened from your fork, we won't be able to trigger test.yaml manually with workflow dispatch... could you temporarily add these jobs to pr.yaml so we can see them in PR CI? That could be reverted before this is merged.

set -euo pipefail

function testImports {
unset imports
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unset imports
local -a imports=()

Would something like this be slightly safer than unset? I think (untested) this would create a function-scoped imports on each call.

Comment on lines +6 to +21
function testImports {
unset imports
while [[ $# -gt 0 ]]; do
# run standalone import test
rapids-logger "Standalone import test for $1"
python -c "import $1" || rapids-logger "Test failed for: $1"
rapids-logger "Passed"
# add import to array for combined import test before shifting
imports+=("$1")
shift
done
import_cmd=$(printf "import %s; " "${imports[@]}")
rapids-logger "Combined import test for: ${imports[*]}"
python -c "${import_cmd}" || rapids-logger "Test failed for: ${imports[*]}"
rapids-logger "Passed"
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it a lot!!!

while [[ $# -gt 0 ]]; do
# run standalone import test
rapids-logger "Standalone import test for $1"
python -c "import $1" || rapids-logger "Test failed for: $1"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will fail the CI jobs when the imports fail, and I think it should, right? I think we want a big ❌ and a job failure, not for us to need to remember to go read the logs.

This || is going to swallow the exit code of python -c "import $1" and the pipe will exit with 0 because that logging statement will succeed.

And I don't see any other logic that's doing something like "grep for Test failed and exit 1 if you find any" in the calling scripts.

I recommend hard-coding in like python -c "import some_nonsense" here and checking that the CI job fails when we want.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you're right. For now I'm going to remove the || and let the errors percolate up. I can put up a follow-up at some point that sticks all the output in a named pipe or something and then greps over all the output for errors

Comment on lines +84 to +98
WHEELS_DIR=$(mktemp -d)
pip download \
--isolated \
--index-url https://pypi.nvidia.com \
--prefer-binary \
--no-deps \
-d "${WHEELS_DIR}" \
"cucim-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"cugraph-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"cuvs-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"cuxfilter-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"libcugraph-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"libcuvs-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"nx-cugraph-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}" \
"pylibcugraph-${CUDA_SUFFIX}==${STABLE_RAPIDS_VERSION}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling it out. I think that's totally fine.

@gforsyth gforsyth force-pushed the stable_install_testing branch from 237840e to 1597c63 Compare April 13, 2026 16:00
@gforsyth
Copy link
Copy Markdown
Contributor Author

Ahh, ok, so shared-workflows definitely assumes we're running a RAPIDS image, because rapids-github-info requires that git is installed, and the sccache setup step assumed that curl and jq are installed.

Even the beefy 5 GB+ cuda-devel images don't have those dependencies installed.

I think I'm going to open a PR to add a "bootstrap script" option to custom_job, that can execute quick install commands before the rest of the shared-workflow steps get kicked off.

@gforsyth
Copy link
Copy Markdown
Contributor Author

Ahh, ok, so shared-workflows definitely assumes we're running a RAPIDS image, because rapids-github-info requires that git is installed, and the sccache setup step assumed that curl and jq are installed.

Even the beefy 5 GB+ cuda-devel images don't have those dependencies installed.

I think I'm going to open a PR to add a "bootstrap script" option to custom_job, that can execute quick install commands before the rest of the shared-workflow steps get kicked off.

Even with the bootstrap script this doesn't work.
actions/checkout runs before any script is executed and because the nvidia/cuda images don't have git installed, checkout pulls a tarball snapshot instead, so then we don't have a proper repo cloned down.

I genuinely didn't think that creating our own version of nvidia/cuda that also includes curl, git, and jq would be the best path forward, but I'm starting to lean in that direction.

@gforsyth gforsyth force-pushed the stable_install_testing branch from c3455ee to 0e78bc0 Compare April 13, 2026 19:20
@gforsyth gforsyth force-pushed the stable_install_testing branch from b3a477a to 0d88656 Compare April 14, 2026 16:14
@gforsyth
Copy link
Copy Markdown
Contributor Author

Met with the cucim team and we noted that cucim is linking against the CUDA driver (will investigate more why that is) -- because of this, we see the import error for cucim:

>>> import cucim
dlopen error libcuda.so.1: cannot open shared object file: No such file or directory
 missing cuda symbols while dynamic loading
 cuFile initialization failed

This only occurs on CPU runners because libcuda.so.1 isn't available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants