Skip to content

chore: update gpu test job setup#341

Merged
mckornfield merged 4 commits intomainfrom
gpu-runner-label-maybe/mck
Apr 3, 2026
Merged

chore: update gpu test job setup#341
mckornfield merged 4 commits intomainfrom
gpu-runner-label-maybe/mck

Conversation

@mckornfield
Copy link
Copy Markdown
Collaborator

  • Change runner gpu labels to ones I saw in the console

@mckornfield mckornfield requested a review from a team as a code owner April 2, 2026 04:24
Copilot AI review requested due to automatic review settings April 2, 2026 04:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the GPU CI workflow to run GPU smoke and E2E tests on a different self-hosted runner label, aligning the workflow configuration with currently available/observed runner labels.

Changes:

  • Switched gpu-smoke-test to run on nemo-ci-aws-gpu-x2.
  • Switched gpu-e2e-test to run on nemo-ci-aws-gpu-x2.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 60 to 64
needs: changes
if: ${{ needs.changes.outputs.src == 'true' || needs.changes.outputs.test == 'true' || github.event_name == 'workflow_dispatch' }}
timeout-minutes: 30
runs-on: linux-amd64-gpu-a100-latest-1
runs-on: nemo-ci-aws-gpu-x2
strategy:
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runner label was changed here, but the workflow’s header comment and the workflows README still describe GPU tests as running on on-prem A100 runners (linux-amd64-gpu-a100-latest-1). Please update those references (and the README runner table / GPU Tests section) to match the new runs-on label and hardware/location, so operational docs don’t drift.

Copilot uses AI. Check for mistakes.
if: ${{ needs.changes.outputs.src == 'true' || needs.changes.outputs.test == 'true' || github.event_name == 'workflow_dispatch' }}
timeout-minutes: 55
runs-on: linux-amd64-gpu-a100-latest-1
runs-on: nemo-ci-aws-gpu-x2
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: switching the E2E job runner label here should be accompanied by updating the documented runner label/hardware (the workflows README currently states E2E runs on A100 linux-amd64-gpu-a100-latest-1, and this file’s top comment says on-prem). Otherwise readers will assume the wrong environment when investigating test failures/perf differences.

Suggested change
runs-on: nemo-ci-aws-gpu-x2
runs-on: linux-amd64-gpu-a100-latest-1

Copilot uses AI. Check for mistakes.
Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>
@mckornfield mckornfield force-pushed the gpu-runner-label-maybe/mck branch from 5e1574b to 2de7400 Compare April 2, 2026 04:35
Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>
Copilot AI review requested due to automatic review settings April 2, 2026 15:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@kendrickb-nvidia kendrickb-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who manages these AWS runners? Do we have access to them?

Previously Aaron had talked with a nemo CI team and they only had azure runners as I understand, but maybe that's changed now.

Please respond to copilot comments too.

And lastly, let's change this to nightly for the moment. The canceling upon another merge to main plus conditionally running these doesn't really give us useful signal right now (even if the jobs were running).

@mckornfield
Copy link
Copy Markdown
Collaborator Author

  1. No clue on the management, I just saw them on the runner tab. They also appear to not have gpus, which is hilarious
  2. I know he talked about them but just based on what runners are visible in the admin view, I don't think those runners are there anymore, or we have to get them added
  3. copilot boo
  4. Yep definitely, on merge is not helpful

Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>
Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>
Copilot AI review requested due to automatic review settings April 3, 2026 18:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

push:
branches:
- main
- "pull-request/[0-9]+"
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

push.branches uses "pull-request/[0-9]+", but GitHub Actions branch filters are glob patterns (not regex). This pattern will only match branch names that literally end with +, so pushes to copy-pr-bot branches like pull-request/123 won’t trigger the workflow. Switch to a glob like pull-request/* (or a more specific glob if needed) so PR GPU checks actually run.

Suggested change
- "pull-request/[0-9]+"
- "pull-request/*"

Copilot uses AI. Check for mistakes.
Comment on lines 24 to 30
on:
schedule:
- cron: '0 2 * * *'
push:
branches:
- main
- "pull-request/[0-9]+"
workflow_dispatch:
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR title/description indicates only runner label updates, but this change adds a nightly schedule trigger and removes pushes to main. If that trigger/behavior change is intended, please update the PR description (and ensure stakeholders are aware); otherwise, limit this PR to label updates.

Copilot uses AI. Check for mistakes.
fetch-depth: 0

- name: Install make
run: apt-get update && apt-get install -y --no-install-recommends make
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step always runs apt-get update && apt-get install ... even when make is already present, which adds avoidable time and external dependency (apt mirrors) to every run. Add a small guard (e.g., check command -v make) so the install only happens when needed.

Suggested change
run: apt-get update && apt-get install -y --no-install-recommends make
run: |
if ! command -v make >/dev/null 2>&1; then
apt-get update
apt-get install -y --no-install-recommends make
fi

Copilot uses AI. Check for mistakes.

- name: Check GPU availability
run: |
uv run python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device count:', torch.cuda.device_count())"
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Check GPU availability” only prints torch.cuda.is_available()/device count and will still succeed when CUDA isn’t available. Either fail explicitly when CUDA is unavailable (for a clearer, earlier failure) or rename the step to indicate it’s informational logging.

Suggested change
uv run python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device count:', torch.cuda.device_count())"
uv run python -c "import sys, torch; available = torch.cuda.is_available(); count = torch.cuda.device_count(); print('cuda available:', available); print('device count:', count); sys.exit('CUDA is not available on this runner') if not available else sys.exit('No CUDA devices detected on this runner') if count < 1 else None"

Copilot uses AI. Check for mistakes.
| -------------------------------------------------- | ---------------------------------------- | ----------------------------------------------------- |
| [ci-checks.yml](ci-checks.yml) | Push to `main`, PRs, manual | Format, typecheck, unit tests, and CPU smoke tests |
| [gpu-tests.yml](gpu-tests.yml) | Push to `main`/`pull-request/*`, manual | GPU smoke tests (required) and E2E tests (A100) |
| [gpu-tests.yml](gpu-tests.yml) | Nightly , manual | GPU smoke tests (required) and E2E tests |
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow overview row lists the trigger as Nightly , manual (extra space before the comma) and no longer mentions the push trigger for pull-request/* branches, which is how copy-pr-bot runs GPU tests for PRs. Update the trigger text to reflect the actual triggers (schedule + pushes to pull-request/* + manual), and fix the spacing.

Suggested change
| [gpu-tests.yml](gpu-tests.yml) | Nightly , manual | GPU smoke tests (required) and E2E tests |
| [gpu-tests.yml](gpu-tests.yml) | Nightly, push to `pull-request/*`, manual | GPU smoke tests (required) and E2E tests |

Copilot uses AI. Check for mistakes.
## GPU Tests Workflow

The `gpu-tests.yml` workflow runs on pushes to `main` and `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:
The `gpu-tests.yml` workflow runs on a schedule and using `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is inaccurate/unclear: the workflow is triggered by pushes to pull-request/* branches (not “using” them), and pull-request/* is the important detail for copy-pr-bot. Consider rephrasing to “runs on a schedule and on pushes to pull-request/* (via copy-pr-bot), and can also be triggered manually …”.

Suggested change
The `gpu-tests.yml` workflow runs on a schedule and using `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:
The `gpu-tests.yml` workflow runs on a schedule and on pushes to `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:

Copilot uses AI. Check for mistakes.
@mckornfield mckornfield changed the title chore: update gpu runner labels chore: update gpu test job setup Apr 3, 2026
Copy link
Copy Markdown
Collaborator

@kendrickb-nvidia kendrickb-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to remove from run on merge which is useless with the conditional triggering and canceling on next merge. Probably still doesn't actually work properly to test with GPU nightly but we need to track down how to properly get access to GPUs in github runners.

@mckornfield mckornfield merged commit 13223bf into main Apr 3, 2026
19 of 20 checks passed
@mckornfield mckornfield deleted the gpu-runner-label-maybe/mck branch April 3, 2026 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants