Skip to content

Commit 13223bf

Browse files
authored
chore: update gpu test job setup (#341)
* Change runner gpu labels to ones I saw in the console --------- Signed-off-by: Matt Kornfield <mkornfield@nvidia.com>
1 parent 4e616ab commit 13223bf

2 files changed

Lines changed: 24 additions & 9 deletions

File tree

.github/workflows/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ All workflows that use `.github/actions/setup-python-env` now default to the ver
1212
| Workflow | Trigger | Description |
1313
| -------------------------------------------------- | ---------------------------------------- | ----------------------------------------------------- |
1414
| [ci-checks.yml](ci-checks.yml) | Push to `main`, PRs, manual | Format, typecheck, unit tests, and CPU smoke tests |
15-
| [gpu-tests.yml](gpu-tests.yml) | Push to `main`/`pull-request/*`, manual | GPU smoke tests (required) and E2E tests (A100) |
15+
| [gpu-tests.yml](gpu-tests.yml) | Nightly , manual | GPU smoke tests (required) and E2E tests |
1616
| [conventional-commit.yml](conventional-commit.yml) | PRs | Validates PR titles follow conventional commit format |
1717
| [docs.yml](docs.yml) | Push to `main` (docs paths) | Builds and deploys documentation to GitHub Pages |
1818
| [release.yml](release.yml) | Manual dispatch | Builds and publishes package to Test PyPI or PyPI (production) |
@@ -133,10 +133,10 @@ All jobs run on `ubuntu-latest` (GitHub-hosted).
133133

134134
## GPU Tests Workflow
135135

136-
The `gpu-tests.yml` workflow runs on pushes to `main` and `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:
136+
The `gpu-tests.yml` workflow runs on a schedule and using `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:
137137

138-
- GPU Smoke Tests: Quick smoke tests on `linux-amd64-gpu-a100-latest-1` (A100) with a 30-minute job timeout and 20-minute step timeout. Required for merge.
139-
- GPU E2E Tests: End-to-end tests on `linux-amd64-gpu-a100-latest-1` (A100) with a 55-minute job timeout and 45-minute step timeout. Informational -- failures produce a warning but don't block merge.
138+
- GPU Smoke Tests: Quick smoke tests on a gpu runner with a 30-minute job timeout and 20-minute step timeout. Required for merge.
139+
- GPU E2E Tests: End-to-end tests on a gpu runner with a 55-minute job timeout and 45-minute step timeout. Informational -- failures produce a warning but don't block merge.
140140
- GPU CI Status: Aggregation job -- single required check for branch protection. Fails if smoke tests fail; warns if E2E tests fail.
141141

142142
The `changes` (Detect Changes) job always runs, including on `workflow_dispatch`. `dorny/paths-filter` outputs `true` for all filters when there is no base commit to diff against, so downstream jobs always run on a manual dispatch. The job must not be conditionally skipped: a skipped `needs` dependency causes downstream jobs to be skipped even when their own `if` condition would pass.
@@ -154,8 +154,8 @@ To trigger from the PR UI and get a status check result, use `/sync` -- see [On-
154154
| Workflow | Job | Runner Label | Type |
155155
| --- | --- | --- | --- |
156156
| CI Checks | All jobs | `ubuntu-latest` | GitHub-hosted |
157-
| GPU Tests | GPU Smoke Tests | `linux-amd64-gpu-a100-latest-1` | NVIDIA self-hosted GPU (A100) |
158-
| GPU Tests | GPU E2E Tests | `linux-amd64-gpu-a100-latest-1` | NVIDIA self-hosted GPU (A100) |
157+
| GPU Tests | GPU Smoke Tests | `nemo-ci-aws-gpu-x2` | NVIDIA self-hosted GPU |
158+
| GPU Tests | GPU E2E Tests | `nemo-ci-aws-gpu-x2` | NVIDIA self-hosted GPU |
159159
| GPU Tests | Detect Changes, GPU CI Status | `linux-amd64-cpu4` | NVIDIA self-hosted CPU (4-core) |
160160
| Dev Wheel | All jobs | `linux-amd64-cpu4` | NVIDIA self-hosted CPU (4-core) |
161161
| Internal Release | All jobs | `linux-amd64-cpu4` | NVIDIA self-hosted CPU (4-core) |

.github/workflows/gpu-tests.yml

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,10 @@
2222
name: GPU Tests
2323

2424
on:
25+
schedule:
26+
- cron: '0 2 * * *'
2527
push:
2628
branches:
27-
- main
2829
- "pull-request/[0-9]+"
2930
workflow_dispatch:
3031

@@ -60,7 +61,7 @@ jobs:
6061
needs: changes
6162
if: ${{ needs.changes.outputs.src == 'true' || needs.changes.outputs.test == 'true' || github.event_name == 'workflow_dispatch' }}
6263
timeout-minutes: 30
63-
runs-on: linux-amd64-gpu-a100-latest-1
64+
runs-on: nemo-ci-aws-gpu-x2
6465
strategy:
6566
fail-fast: false
6667
matrix:
@@ -71,6 +72,9 @@ jobs:
7172
with:
7273
fetch-depth: 0
7374

75+
- name: Install make
76+
run: apt-get update && apt-get install -y --no-install-recommends make
77+
7478
- name: Setup Python environment
7579
uses: ./.github/actions/setup-python-env
7680
with:
@@ -80,6 +84,10 @@ jobs:
8084
- name: Bootstrap CUDA environment
8185
run: make bootstrap-nss cu128
8286

87+
- name: Check GPU availability
88+
run: |
89+
uv run python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device count:', torch.cuda.device_count())"
90+
8391
- name: Run GPU smoke tests
8492
timeout-minutes: 20
8593
run: make test-smoke-gpu
@@ -89,13 +97,16 @@ jobs:
8997
needs: changes
9098
if: ${{ needs.changes.outputs.src == 'true' || needs.changes.outputs.test == 'true' || github.event_name == 'workflow_dispatch' }}
9199
timeout-minutes: 55
92-
runs-on: linux-amd64-gpu-a100-latest-1
100+
runs-on: nemo-ci-aws-gpu-x2
93101
steps:
94102
- name: checkout
95103
uses: actions/checkout@v6
96104
with:
97105
fetch-depth: 0
98106

107+
- name: Install make
108+
run: apt-get update && apt-get install -y --no-install-recommends make
109+
99110
- name: Setup Python environment
100111
uses: ./.github/actions/setup-python-env
101112
with:
@@ -105,6 +116,10 @@ jobs:
105116
- name: Bootstrap CUDA environment
106117
run: make bootstrap-nss cu128
107118

119+
- name: Check GPU availability
120+
run: |
121+
uv run python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device count:', torch.cuda.device_count())"
122+
108123
- name: Run GPU E2E tests
109124
timeout-minutes: 45
110125
run: make test-e2e

0 commit comments

Comments
 (0)