chore: update gpu test job setup by mckornfield · Pull Request #341 · NVIDIA-NeMo/Safe-Synthesizer

mckornfield · 2026-04-02T04:24:15Z

Change runner gpu labels to ones I saw in the console

Copilot

Pull request overview

Updates the GPU CI workflow to run GPU smoke and E2E tests on a different self-hosted runner label, aligning the workflow configuration with currently available/observed runner labels.

Changes:

Switched gpu-smoke-test to run on nemo-ci-aws-gpu-x2.
Switched gpu-e2e-test to run on nemo-ci-aws-gpu-x2.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T04:26:03Z

.github/workflows/gpu-tests.yml

    needs: changes
    if: ${{ needs.changes.outputs.src == 'true' || needs.changes.outputs.test == 'true' || github.event_name == 'workflow_dispatch' }}
    timeout-minutes: 30
-    runs-on: linux-amd64-gpu-a100-latest-1
+    runs-on: nemo-ci-aws-gpu-x2
    strategy:


The runner label was changed here, but the workflow’s header comment and the workflows README still describe GPU tests as running on on-prem A100 runners (linux-amd64-gpu-a100-latest-1). Please update those references (and the README runner table / GPU Tests section) to match the new runs-on label and hardware/location, so operational docs don’t drift.

Copilot · 2026-04-02T04:26:04Z

.github/workflows/gpu-tests.yml

    if: ${{ needs.changes.outputs.src == 'true' || needs.changes.outputs.test == 'true' || github.event_name == 'workflow_dispatch' }}
    timeout-minutes: 55
-    runs-on: linux-amd64-gpu-a100-latest-1
+    runs-on: nemo-ci-aws-gpu-x2


Same as above: switching the E2E job runner label here should be accompanied by updating the documented runner label/hardware (the workflows README currently states E2E runs on A100 linux-amd64-gpu-a100-latest-1, and this file’s top comment says on-prem). Otherwise readers will assume the wrong environment when investigating test failures/perf differences.

Suggested change

runs-on: nemo-ci-aws-gpu-x2

runs-on: linux-amd64-gpu-a100-latest-1

Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.github/workflows/gpu-tests.yml

kendrickb-nvidia

Who manages these AWS runners? Do we have access to them?

Previously Aaron had talked with a nemo CI team and they only had azure runners as I understand, but maybe that's changed now.

Please respond to copilot comments too.

And lastly, let's change this to nightly for the moment. The canceling upon another merge to main plus conditionally running these doesn't really give us useful signal right now (even if the jobs were running).

mckornfield · 2026-04-02T16:58:53Z

No clue on the management, I just saw them on the runner tab. They also appear to not have gpus, which is hilarious
I know he talked about them but just based on what runners are visible in the admin view, I don't think those runners are there anymore, or we have to get them added
copilot boo
Yep definitely, on merge is not helpful

Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T18:07:50Z

.github/workflows/gpu-tests.yml

  push:
    branches:
-      - main
      - "pull-request/[0-9]+"


push.branches uses "pull-request/[0-9]+", but GitHub Actions branch filters are glob patterns (not regex). This pattern will only match branch names that literally end with +, so pushes to copy-pr-bot branches like pull-request/123 won’t trigger the workflow. Switch to a glob like pull-request/* (or a more specific glob if needed) so PR GPU checks actually run.

Suggested change

- "pull-request/[0-9]+"

- "pull-request/*"

Copilot · 2026-04-03T18:07:50Z

.github/workflows/gpu-tests.yml

 on:
+  schedule:
+    - cron: '0 2 * * *'
  push:
    branches:
-      - main
      - "pull-request/[0-9]+"
  workflow_dispatch:


PR title/description indicates only runner label updates, but this change adds a nightly schedule trigger and removes pushes to main. If that trigger/behavior change is intended, please update the PR description (and ensure stakeholders are aware); otherwise, limit this PR to label updates.

Copilot · 2026-04-03T18:07:50Z

.github/workflows/gpu-tests.yml

          fetch-depth: 0

+      - name: Install make
+        run: apt-get update && apt-get install -y --no-install-recommends make


This step always runs apt-get update && apt-get install ... even when make is already present, which adds avoidable time and external dependency (apt mirrors) to every run. Add a small guard (e.g., check command -v make) so the install only happens when needed.

Suggested change

run: apt-get update && apt-get install -y --no-install-recommends make

run: |

if ! command -v make >/dev/null 2>&1; then

apt-get update

apt-get install -y --no-install-recommends make

fi

Copilot · 2026-04-03T18:07:50Z

.github/workflows/gpu-tests.yml


+      - name: Check GPU availability
+        run: |
+          uv run python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device count:', torch.cuda.device_count())"


“Check GPU availability” only prints torch.cuda.is_available()/device count and will still succeed when CUDA isn’t available. Either fail explicitly when CUDA is unavailable (for a clearer, earlier failure) or rename the step to indicate it’s informational logging.

Suggested change

uv run python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device count:', torch.cuda.device_count())"

uv run python -c "import sys, torch; available = torch.cuda.is_available(); count = torch.cuda.device_count(); print('cuda available:', available); print('device count:', count); sys.exit('CUDA is not available on this runner') if not available else sys.exit('No CUDA devices detected on this runner') if count < 1 else None"

Copilot · 2026-04-03T18:07:51Z

.github/workflows/README.md

 | -------------------------------------------------- | ---------------------------------------- | ----------------------------------------------------- |
 | [ci-checks.yml](ci-checks.yml)                     | Push to `main`, PRs, manual              | Format, typecheck, unit tests, and CPU smoke tests    |
-| [gpu-tests.yml](gpu-tests.yml)                     | Push to `main`/`pull-request/*`, manual  | GPU smoke tests (required) and E2E tests (A100)       |
+| [gpu-tests.yml](gpu-tests.yml)                     | Nightly , manual                         | GPU smoke tests (required) and E2E tests              |


The workflow overview row lists the trigger as Nightly , manual (extra space before the comma) and no longer mentions the push trigger for pull-request/* branches, which is how copy-pr-bot runs GPU tests for PRs. Update the trigger text to reflect the actual triggers (schedule + pushes to pull-request/* + manual), and fix the spacing.

Suggested change

| [gpu-tests.yml](gpu-tests.yml) | Nightly , manual | GPU smoke tests (required) and E2E tests |

| [gpu-tests.yml](gpu-tests.yml) | Nightly, push to `pull-request/*`, manual | GPU smoke tests (required) and E2E tests |

Copilot · 2026-04-03T18:07:51Z

.github/workflows/README.md

 ## GPU Tests Workflow

-The `gpu-tests.yml` workflow runs on pushes to `main` and `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:
+The `gpu-tests.yml` workflow runs on a schedule and using `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:


This sentence is inaccurate/unclear: the workflow is triggered by pushes to pull-request/* branches (not “using” them), and pull-request/* is the important detail for copy-pr-bot. Consider rephrasing to “runs on a schedule and on pushes to pull-request/* (via copy-pr-bot), and can also be triggered manually …”.

Suggested change

The `gpu-tests.yml` workflow runs on a schedule and using `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:

The `gpu-tests.yml` workflow runs on a schedule and on pushes to `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:

kendrickb-nvidia

Approving to remove from run on merge which is useless with the conditional triggering and canceling on next merge. Probably still doesn't actually work properly to test with GPU nightly but we need to track down how to properly get access to GPUs in github runners.

mckornfield requested a review from a team as a code owner April 2, 2026 04:24

Copilot AI review requested due to automatic review settings April 2, 2026 04:24

Copilot started reviewing on behalf of mckornfield April 2, 2026 04:24 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

chore: update gpu runner labels, install make

2de7400

Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>

mckornfield force-pushed the gpu-runner-label-maybe/mck branch from 5e1574b to 2de7400 Compare April 2, 2026 04:35

add gpu device check

a81b21c

Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>

Copilot AI review requested due to automatic review settings April 2, 2026 15:32

Copilot started reviewing on behalf of mckornfield April 2, 2026 15:33 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

.github/workflows/gpu-tests.yml Show resolved Hide resolved

.github/workflows/gpu-tests.yml Show resolved Hide resolved

.github/workflows/gpu-tests.yml Show resolved Hide resolved

kendrickb-nvidia reviewed Apr 2, 2026

View reviewed changes

mckornfield added 2 commits April 2, 2026 12:34

move to cron, update readme

efb10b0

Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>

no nvidia smi

ee9fc2f

Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>

Copilot AI review requested due to automatic review settings April 3, 2026 18:04

Copilot started reviewing on behalf of mckornfield April 3, 2026 18:04 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

mckornfield changed the title ~~chore: update gpu runner labels~~ chore: update gpu test job setup Apr 3, 2026

kendrickb-nvidia approved these changes Apr 3, 2026

View reviewed changes

mckornfield merged commit 13223bf into main Apr 3, 2026
19 of 20 checks passed

mckornfield deleted the gpu-runner-label-maybe/mck branch April 3, 2026 18:58

	runs-on: nemo-ci-aws-gpu-x2
	runs-on: linux-amd64-gpu-a100-latest-1

	uv run python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device count:', torch.cuda.device_count())"
	uv run python -c "import sys, torch; available = torch.cuda.is_available(); count = torch.cuda.device_count(); print('cuda available:', available); print('device count:', count); sys.exit('CUDA is not available on this runner') if not available else sys.exit('No CUDA devices detected on this runner') if count < 1 else None"

	\| [gpu-tests.yml](gpu-tests.yml) \| Nightly , manual \| GPU smoke tests (required) and E2E tests \|
	\| [gpu-tests.yml](gpu-tests.yml) \| Nightly, push to `pull-request/*`, manual \| GPU smoke tests (required) and E2E tests \|

	The `gpu-tests.yml` workflow runs on a schedule and using `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:
	The `gpu-tests.yml` workflow runs on a schedule and on pushes to `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:

Conversation

mckornfield commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kendrickb-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

mckornfield commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

kendrickb-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants