-
Notifications
You must be signed in to change notification settings - Fork 0
chore: update gpu test job setup #341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -12,7 +12,7 @@ All workflows that use `.github/actions/setup-python-env` now default to the ver | |||||
| | Workflow | Trigger | Description | | ||||||
| | -------------------------------------------------- | ---------------------------------------- | ----------------------------------------------------- | | ||||||
| | [ci-checks.yml](ci-checks.yml) | Push to `main`, PRs, manual | Format, typecheck, unit tests, and CPU smoke tests | | ||||||
| | [gpu-tests.yml](gpu-tests.yml) | Push to `main`/`pull-request/*`, manual | GPU smoke tests (required) and E2E tests (A100) | | ||||||
| | [gpu-tests.yml](gpu-tests.yml) | Nightly , manual | GPU smoke tests (required) and E2E tests | | ||||||
| | [conventional-commit.yml](conventional-commit.yml) | PRs | Validates PR titles follow conventional commit format | | ||||||
| | [docs.yml](docs.yml) | Push to `main` (docs paths) | Builds and deploys documentation to GitHub Pages | | ||||||
| | [release.yml](release.yml) | Manual dispatch | Builds and publishes package to Test PyPI or PyPI (production) | | ||||||
|
|
@@ -133,10 +133,10 @@ All jobs run on `ubuntu-latest` (GitHub-hosted). | |||||
|
|
||||||
| ## GPU Tests Workflow | ||||||
|
|
||||||
| The `gpu-tests.yml` workflow runs on pushes to `main` and `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`: | ||||||
| The `gpu-tests.yml` workflow runs on a schedule and using `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`: | ||||||
|
||||||
| The `gpu-tests.yml` workflow runs on a schedule and using `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`: | |
| The `gpu-tests.yml` workflow runs on a schedule and on pushes to `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`: |
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -22,9 +22,10 @@ | |||||||||||||
| name: GPU Tests | ||||||||||||||
|
|
||||||||||||||
| on: | ||||||||||||||
| schedule: | ||||||||||||||
| - cron: '0 2 * * *' | ||||||||||||||
| push: | ||||||||||||||
| branches: | ||||||||||||||
| - main | ||||||||||||||
| - "pull-request/[0-9]+" | ||||||||||||||
|
||||||||||||||
| - "pull-request/[0-9]+" | |
| - "pull-request/*" |
Copilot
AI
Apr 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR title/description indicates only runner label updates, but this change adds a nightly schedule trigger and removes pushes to main. If that trigger/behavior change is intended, please update the PR description (and ensure stakeholders are aware); otherwise, limit this PR to label updates.
Copilot
AI
Apr 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runner label was changed here, but the workflow’s header comment and the workflows README still describe GPU tests as running on on-prem A100 runners (linux-amd64-gpu-a100-latest-1). Please update those references (and the README runner table / GPU Tests section) to match the new runs-on label and hardware/location, so operational docs don’t drift.
mckornfield marked this conversation as resolved.
Show resolved
Hide resolved
Copilot
AI
Apr 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step always runs apt-get update && apt-get install ... even when make is already present, which adds avoidable time and external dependency (apt mirrors) to every run. Add a small guard (e.g., check command -v make) so the install only happens when needed.
| run: apt-get update && apt-get install -y --no-install-recommends make | |
| run: | | |
| if ! command -v make >/dev/null 2>&1; then | |
| apt-get update | |
| apt-get install -y --no-install-recommends make | |
| fi |
Copilot
AI
Apr 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“Check GPU availability” only prints torch.cuda.is_available()/device count and will still succeed when CUDA isn’t available. Either fail explicitly when CUDA is unavailable (for a clearer, earlier failure) or rename the step to indicate it’s informational logging.
| uv run python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device count:', torch.cuda.device_count())" | |
| uv run python -c "import sys, torch; available = torch.cuda.is_available(); count = torch.cuda.device_count(); print('cuda available:', available); print('device count:', count); sys.exit('CUDA is not available on this runner') if not available else sys.exit('No CUDA devices detected on this runner') if count < 1 else None" |
mckornfield marked this conversation as resolved.
Show resolved
Hide resolved
Copilot
AI
Apr 2, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above: switching the E2E job runner label here should be accompanied by updating the documented runner label/hardware (the workflows README currently states E2E runs on A100 linux-amd64-gpu-a100-latest-1, and this file’s top comment says on-prem). Otherwise readers will assume the wrong environment when investigating test failures/perf differences.
| runs-on: nemo-ci-aws-gpu-x2 | |
| runs-on: linux-amd64-gpu-a100-latest-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow overview row lists the trigger as
Nightly , manual(extra space before the comma) and no longer mentions the push trigger forpull-request/*branches, which is how copy-pr-bot runs GPU tests for PRs. Update the trigger text to reflect the actual triggers (schedule + pushes topull-request/*+ manual), and fix the spacing.