Conversation
mckornfield
commented
Apr 2, 2026
- Change runner gpu labels to ones I saw in the console
There was a problem hiding this comment.
Pull request overview
Updates the GPU CI workflow to run GPU smoke and E2E tests on a different self-hosted runner label, aligning the workflow configuration with currently available/observed runner labels.
Changes:
- Switched
gpu-smoke-testto run onnemo-ci-aws-gpu-x2. - Switched
gpu-e2e-testto run onnemo-ci-aws-gpu-x2.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| needs: changes | ||
| if: ${{ needs.changes.outputs.src == 'true' || needs.changes.outputs.test == 'true' || github.event_name == 'workflow_dispatch' }} | ||
| timeout-minutes: 30 | ||
| runs-on: linux-amd64-gpu-a100-latest-1 | ||
| runs-on: nemo-ci-aws-gpu-x2 | ||
| strategy: |
There was a problem hiding this comment.
The runner label was changed here, but the workflow’s header comment and the workflows README still describe GPU tests as running on on-prem A100 runners (linux-amd64-gpu-a100-latest-1). Please update those references (and the README runner table / GPU Tests section) to match the new runs-on label and hardware/location, so operational docs don’t drift.
| if: ${{ needs.changes.outputs.src == 'true' || needs.changes.outputs.test == 'true' || github.event_name == 'workflow_dispatch' }} | ||
| timeout-minutes: 55 | ||
| runs-on: linux-amd64-gpu-a100-latest-1 | ||
| runs-on: nemo-ci-aws-gpu-x2 |
There was a problem hiding this comment.
Same as above: switching the E2E job runner label here should be accompanied by updating the documented runner label/hardware (the workflows README currently states E2E runs on A100 linux-amd64-gpu-a100-latest-1, and this file’s top comment says on-prem). Otherwise readers will assume the wrong environment when investigating test failures/perf differences.
| runs-on: nemo-ci-aws-gpu-x2 | |
| runs-on: linux-amd64-gpu-a100-latest-1 |
Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>
5e1574b to
2de7400
Compare
Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
kendrickb-nvidia
left a comment
There was a problem hiding this comment.
Who manages these AWS runners? Do we have access to them?
Previously Aaron had talked with a nemo CI team and they only had azure runners as I understand, but maybe that's changed now.
Please respond to copilot comments too.
And lastly, let's change this to nightly for the moment. The canceling upon another merge to main plus conditionally running these doesn't really give us useful signal right now (even if the jobs were running).
|
Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>
Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| push: | ||
| branches: | ||
| - main | ||
| - "pull-request/[0-9]+" |
There was a problem hiding this comment.
push.branches uses "pull-request/[0-9]+", but GitHub Actions branch filters are glob patterns (not regex). This pattern will only match branch names that literally end with +, so pushes to copy-pr-bot branches like pull-request/123 won’t trigger the workflow. Switch to a glob like pull-request/* (or a more specific glob if needed) so PR GPU checks actually run.
| - "pull-request/[0-9]+" | |
| - "pull-request/*" |
| on: | ||
| schedule: | ||
| - cron: '0 2 * * *' | ||
| push: | ||
| branches: | ||
| - main | ||
| - "pull-request/[0-9]+" | ||
| workflow_dispatch: |
There was a problem hiding this comment.
PR title/description indicates only runner label updates, but this change adds a nightly schedule trigger and removes pushes to main. If that trigger/behavior change is intended, please update the PR description (and ensure stakeholders are aware); otherwise, limit this PR to label updates.
| fetch-depth: 0 | ||
|
|
||
| - name: Install make | ||
| run: apt-get update && apt-get install -y --no-install-recommends make |
There was a problem hiding this comment.
This step always runs apt-get update && apt-get install ... even when make is already present, which adds avoidable time and external dependency (apt mirrors) to every run. Add a small guard (e.g., check command -v make) so the install only happens when needed.
| run: apt-get update && apt-get install -y --no-install-recommends make | |
| run: | | |
| if ! command -v make >/dev/null 2>&1; then | |
| apt-get update | |
| apt-get install -y --no-install-recommends make | |
| fi |
|
|
||
| - name: Check GPU availability | ||
| run: | | ||
| uv run python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device count:', torch.cuda.device_count())" |
There was a problem hiding this comment.
“Check GPU availability” only prints torch.cuda.is_available()/device count and will still succeed when CUDA isn’t available. Either fail explicitly when CUDA is unavailable (for a clearer, earlier failure) or rename the step to indicate it’s informational logging.
| uv run python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device count:', torch.cuda.device_count())" | |
| uv run python -c "import sys, torch; available = torch.cuda.is_available(); count = torch.cuda.device_count(); print('cuda available:', available); print('device count:', count); sys.exit('CUDA is not available on this runner') if not available else sys.exit('No CUDA devices detected on this runner') if count < 1 else None" |
| | -------------------------------------------------- | ---------------------------------------- | ----------------------------------------------------- | | ||
| | [ci-checks.yml](ci-checks.yml) | Push to `main`, PRs, manual | Format, typecheck, unit tests, and CPU smoke tests | | ||
| | [gpu-tests.yml](gpu-tests.yml) | Push to `main`/`pull-request/*`, manual | GPU smoke tests (required) and E2E tests (A100) | | ||
| | [gpu-tests.yml](gpu-tests.yml) | Nightly , manual | GPU smoke tests (required) and E2E tests | |
There was a problem hiding this comment.
The workflow overview row lists the trigger as Nightly , manual (extra space before the comma) and no longer mentions the push trigger for pull-request/* branches, which is how copy-pr-bot runs GPU tests for PRs. Update the trigger text to reflect the actual triggers (schedule + pushes to pull-request/* + manual), and fix the spacing.
| | [gpu-tests.yml](gpu-tests.yml) | Nightly , manual | GPU smoke tests (required) and E2E tests | | |
| | [gpu-tests.yml](gpu-tests.yml) | Nightly, push to `pull-request/*`, manual | GPU smoke tests (required) and E2E tests | |
| ## GPU Tests Workflow | ||
|
|
||
| The `gpu-tests.yml` workflow runs on pushes to `main` and `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`: | ||
| The `gpu-tests.yml` workflow runs on a schedule and using `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`: |
There was a problem hiding this comment.
This sentence is inaccurate/unclear: the workflow is triggered by pushes to pull-request/* branches (not “using” them), and pull-request/* is the important detail for copy-pr-bot. Consider rephrasing to “runs on a schedule and on pushes to pull-request/* (via copy-pr-bot), and can also be triggered manually …”.
| The `gpu-tests.yml` workflow runs on a schedule and using `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`: | |
| The `gpu-tests.yml` workflow runs on a schedule and on pushes to `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`: |
kendrickb-nvidia
left a comment
There was a problem hiding this comment.
Approving to remove from run on merge which is useless with the conditional triggering and canceling on next merge. Probably still doesn't actually work properly to test with GPU nightly but we need to track down how to properly get access to GPUs in github runners.