Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ All workflows that use `.github/actions/setup-python-env` now default to the ver
| Workflow | Trigger | Description |
| -------------------------------------------------- | ---------------------------------------- | ----------------------------------------------------- |
| [ci-checks.yml](ci-checks.yml) | Push to `main`, PRs, manual | Format, typecheck, unit tests, and CPU smoke tests |
| [gpu-tests.yml](gpu-tests.yml) | Push to `main`/`pull-request/*`, manual | GPU smoke tests (required) and E2E tests (A100) |
| [gpu-tests.yml](gpu-tests.yml) | Nightly , manual | GPU smoke tests (required) and E2E tests |
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow overview row lists the trigger as Nightly , manual (extra space before the comma) and no longer mentions the push trigger for pull-request/* branches, which is how copy-pr-bot runs GPU tests for PRs. Update the trigger text to reflect the actual triggers (schedule + pushes to pull-request/* + manual), and fix the spacing.

Suggested change
| [gpu-tests.yml](gpu-tests.yml) | Nightly , manual | GPU smoke tests (required) and E2E tests |
| [gpu-tests.yml](gpu-tests.yml) | Nightly, push to `pull-request/*`, manual | GPU smoke tests (required) and E2E tests |

Copilot uses AI. Check for mistakes.
| [conventional-commit.yml](conventional-commit.yml) | PRs | Validates PR titles follow conventional commit format |
| [docs.yml](docs.yml) | Push to `main` (docs paths) | Builds and deploys documentation to GitHub Pages |
| [release.yml](release.yml) | Manual dispatch | Builds and publishes package to Test PyPI or PyPI (production) |
Expand Down Expand Up @@ -133,10 +133,10 @@ All jobs run on `ubuntu-latest` (GitHub-hosted).

## GPU Tests Workflow

The `gpu-tests.yml` workflow runs on pushes to `main` and `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:
The `gpu-tests.yml` workflow runs on a schedule and using `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is inaccurate/unclear: the workflow is triggered by pushes to pull-request/* branches (not “using” them), and pull-request/* is the important detail for copy-pr-bot. Consider rephrasing to “runs on a schedule and on pushes to pull-request/* (via copy-pr-bot), and can also be triggered manually …”.

Suggested change
The `gpu-tests.yml` workflow runs on a schedule and using `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:
The `gpu-tests.yml` workflow runs on a schedule and on pushes to `pull-request/*` branches (via copy-pr-bot), and can also be triggered manually via `workflow_dispatch`:

Copilot uses AI. Check for mistakes.

- GPU Smoke Tests: Quick smoke tests on `linux-amd64-gpu-a100-latest-1` (A100) with a 30-minute job timeout and 20-minute step timeout. Required for merge.
- GPU E2E Tests: End-to-end tests on `linux-amd64-gpu-a100-latest-1` (A100) with a 55-minute job timeout and 45-minute step timeout. Informational -- failures produce a warning but don't block merge.
- GPU Smoke Tests: Quick smoke tests on a gpu runner with a 30-minute job timeout and 20-minute step timeout. Required for merge.
- GPU E2E Tests: End-to-end tests on a gpu runner with a 55-minute job timeout and 45-minute step timeout. Informational -- failures produce a warning but don't block merge.
- GPU CI Status: Aggregation job -- single required check for branch protection. Fails if smoke tests fail; warns if E2E tests fail.

The `changes` (Detect Changes) job always runs, including on `workflow_dispatch`. `dorny/paths-filter` outputs `true` for all filters when there is no base commit to diff against, so downstream jobs always run on a manual dispatch. The job must not be conditionally skipped: a skipped `needs` dependency causes downstream jobs to be skipped even when their own `if` condition would pass.
Expand All @@ -154,8 +154,8 @@ To trigger from the PR UI and get a status check result, use `/sync` -- see [On-
| Workflow | Job | Runner Label | Type |
| --- | --- | --- | --- |
| CI Checks | All jobs | `ubuntu-latest` | GitHub-hosted |
| GPU Tests | GPU Smoke Tests | `linux-amd64-gpu-a100-latest-1` | NVIDIA self-hosted GPU (A100) |
| GPU Tests | GPU E2E Tests | `linux-amd64-gpu-a100-latest-1` | NVIDIA self-hosted GPU (A100) |
| GPU Tests | GPU Smoke Tests | `nemo-ci-aws-gpu-x2` | NVIDIA self-hosted GPU |
| GPU Tests | GPU E2E Tests | `nemo-ci-aws-gpu-x2` | NVIDIA self-hosted GPU |
| GPU Tests | Detect Changes, GPU CI Status | `linux-amd64-cpu4` | NVIDIA self-hosted CPU (4-core) |
| Dev Wheel | All jobs | `linux-amd64-cpu4` | NVIDIA self-hosted CPU (4-core) |
| Internal Release | All jobs | `linux-amd64-cpu4` | NVIDIA self-hosted CPU (4-core) |
Expand Down
21 changes: 18 additions & 3 deletions .github/workflows/gpu-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,10 @@
name: GPU Tests

on:
schedule:
- cron: '0 2 * * *'
push:
branches:
- main
- "pull-request/[0-9]+"
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

push.branches uses "pull-request/[0-9]+", but GitHub Actions branch filters are glob patterns (not regex). This pattern will only match branch names that literally end with +, so pushes to copy-pr-bot branches like pull-request/123 won’t trigger the workflow. Switch to a glob like pull-request/* (or a more specific glob if needed) so PR GPU checks actually run.

Suggested change
- "pull-request/[0-9]+"
- "pull-request/*"

Copilot uses AI. Check for mistakes.
workflow_dispatch:
Comment on lines 24 to 30
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR title/description indicates only runner label updates, but this change adds a nightly schedule trigger and removes pushes to main. If that trigger/behavior change is intended, please update the PR description (and ensure stakeholders are aware); otherwise, limit this PR to label updates.

Copilot uses AI. Check for mistakes.

Expand Down Expand Up @@ -60,7 +61,7 @@ jobs:
needs: changes
if: ${{ needs.changes.outputs.src == 'true' || needs.changes.outputs.test == 'true' || github.event_name == 'workflow_dispatch' }}
timeout-minutes: 30
runs-on: linux-amd64-gpu-a100-latest-1
runs-on: nemo-ci-aws-gpu-x2
strategy:
Comment on lines 61 to 65
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runner label was changed here, but the workflow’s header comment and the workflows README still describe GPU tests as running on on-prem A100 runners (linux-amd64-gpu-a100-latest-1). Please update those references (and the README runner table / GPU Tests section) to match the new runs-on label and hardware/location, so operational docs don’t drift.

Copilot uses AI. Check for mistakes.
fail-fast: false
matrix:
Expand All @@ -71,6 +72,9 @@ jobs:
with:
fetch-depth: 0

- name: Install make
run: apt-get update && apt-get install -y --no-install-recommends make
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step always runs apt-get update && apt-get install ... even when make is already present, which adds avoidable time and external dependency (apt mirrors) to every run. Add a small guard (e.g., check command -v make) so the install only happens when needed.

Suggested change
run: apt-get update && apt-get install -y --no-install-recommends make
run: |
if ! command -v make >/dev/null 2>&1; then
apt-get update
apt-get install -y --no-install-recommends make
fi

Copilot uses AI. Check for mistakes.

- name: Setup Python environment
uses: ./.github/actions/setup-python-env
with:
Expand All @@ -80,6 +84,10 @@ jobs:
- name: Bootstrap CUDA environment
run: make bootstrap-nss cu128

- name: Check GPU availability
run: |
uv run python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device count:', torch.cuda.device_count())"
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Check GPU availability” only prints torch.cuda.is_available()/device count and will still succeed when CUDA isn’t available. Either fail explicitly when CUDA is unavailable (for a clearer, earlier failure) or rename the step to indicate it’s informational logging.

Suggested change
uv run python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device count:', torch.cuda.device_count())"
uv run python -c "import sys, torch; available = torch.cuda.is_available(); count = torch.cuda.device_count(); print('cuda available:', available); print('device count:', count); sys.exit('CUDA is not available on this runner') if not available else sys.exit('No CUDA devices detected on this runner') if count < 1 else None"

Copilot uses AI. Check for mistakes.

- name: Run GPU smoke tests
timeout-minutes: 20
run: make test-smoke-gpu
Expand All @@ -89,13 +97,16 @@ jobs:
needs: changes
if: ${{ needs.changes.outputs.src == 'true' || needs.changes.outputs.test == 'true' || github.event_name == 'workflow_dispatch' }}
timeout-minutes: 55
runs-on: linux-amd64-gpu-a100-latest-1
runs-on: nemo-ci-aws-gpu-x2
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above: switching the E2E job runner label here should be accompanied by updating the documented runner label/hardware (the workflows README currently states E2E runs on A100 linux-amd64-gpu-a100-latest-1, and this file’s top comment says on-prem). Otherwise readers will assume the wrong environment when investigating test failures/perf differences.

Suggested change
runs-on: nemo-ci-aws-gpu-x2
runs-on: linux-amd64-gpu-a100-latest-1

Copilot uses AI. Check for mistakes.
steps:
- name: checkout
uses: actions/checkout@v6
with:
fetch-depth: 0

- name: Install make
run: apt-get update && apt-get install -y --no-install-recommends make

- name: Setup Python environment
uses: ./.github/actions/setup-python-env
with:
Expand All @@ -105,6 +116,10 @@ jobs:
- name: Bootstrap CUDA environment
run: make bootstrap-nss cu128

- name: Check GPU availability
run: |
uv run python -c "import torch; print('cuda available:', torch.cuda.is_available()); print('device count:', torch.cuda.device_count())"

- name: Run GPU E2E tests
timeout-minutes: 45
run: make test-e2e
Expand Down
Loading