Skip to content

Add nightly job#142

Merged
skyw merged 3 commits intomainfrom
skyw/nightly_ci
Mar 20, 2026
Merged

Add nightly job#142
skyw merged 3 commits intomainfrom
skyw/nightly_ci

Conversation

@skyw
Copy link
Contributor

@skyw skyw commented Mar 20, 2026

No description provided.

Signed-off-by: Hao Wu <skyw@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@skyw
Copy link
Contributor Author

skyw commented Mar 20, 2026

@greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 20, 2026

Greptile Summary

This PR extends the GitHub Actions CI/CD pipeline with a dedicated nightly job that runs L1_Tests_GPU only on the schedule trigger, keeping those longer/heavier tests out of the regular per-PR flow. It pairs the new test job with a publish-nightly-test-results publisher, scopes the existing publish-test-results download pattern to L0_* artifacts only, and adds a name-based jq exclusion in the Nemo_CICD_Test gate so that nightly failures don't block PR merges.

Key changes:

  • New cicd-nightly-tests job: runs only on schedule, uses the existing test-template action with a 60-minute timeout, uploads results under test-results-L1_Tests_GPU.
  • New publish-nightly-test-results job: downloads test-results-L1_* artifacts and publishes them with check_name: "Nightly Test Results".
  • publish-test-results download pattern narrowed from test-results-*test-results-L0_* to avoid accidentally collecting L1 artifacts.
  • Nemo_CICD_Test jq filters updated to exclude jobs named L1_Tests_GPU, so nightly failures are non-blocking.
  • Two style-level observations: the L1_Tests_GPU string is hardcoded in two places in the gate script and must stay in sync with the job's name: field; and the is_ci_workload/force_run_all fallback branches inside the nightly job's if condition are unreachable on schedule events.

Confidence Score: 4/5

  • Safe to merge — nightly job is isolated from the PR gate and all changes are additive with no impact on existing non-schedule runs.
  • The design is sound: the nightly job only fires on schedule events and its failures are deliberately excluded from the merge gate. The two flagged items are style/maintenance concerns (hardcoded string coupling and dead conditional branches) rather than correctness bugs.
  • No files require special attention, though the hardcoded L1_Tests_GPU string in the jq filters at lines 233-234 should be kept in sync with the name: field on line 127.

Important Files Changed

Filename Overview
.github/workflows/cicd-main.yml Adds cicd-nightly-tests (L1_Tests_GPU on schedule) and publish-nightly-test-results jobs; narrows publish-test-results artifact pattern to L0 only; filters L1 job name out of Nemo_CICD_Test gate logic. Two style issues: hardcoded job-name string in the jq exclusion filter and unreachable fallback conditions in the nightly job's if block.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[pre-flight] --> B[cicd-wait-in-queue]
    A --> C[cicd-container-build]
    B --> C
    C --> D[cicd-unit-tests\nL0_Tests_GPU / L0_Tests_CPU]
    C --> E[cicd-nightly-tests\nL1_Tests_GPU\n⏰ schedule only]

    D --> F[publish-test-results\npattern: test-results-L0_*]
    E --> G[publish-nightly-test-results\npattern: test-results-L1_*\n⏰ schedule only]

    D --> H[Nemo_CICD_Test\ngate job\nexcludes L1_Tests_GPU from\nfailure/cancel counts]
    C --> H
    A --> H

    H --> I[Coverage_Fake / Coverage]

    style E fill:#f0e68c,stroke:#c0a030
    style G fill:#f0e68c,stroke:#c0a030
Loading

Comments Outside Diff (2)

  1. .github/workflows/cicd-main.yml, line 233-234 (link)

    Hardcoded job name tightly coupled to exclusion filter

    The jq filters on these two lines rely on the hardcoded string "L1_Tests_GPU", which must exactly match the name: field of the cicd-nightly-tests job (line 127). If the nightly job is ever renamed, these filters will silently start counting its failures/cancellations as blocking failures in the gate job — which is the opposite of the intended behavior.

    Consider extracting it into a shared environment variable or a comment pairing it to line 127, to make the coupling explicit:

    Alternatively, filtering by the L0_ prefix (matching the existing cicd-unit-tests matrix scripts) is more robust: only L0 jobs affect the gate, while any future L1+ nightly jobs are automatically excluded.

  2. .github/workflows/cicd-main.yml, line 129-136 (link)

    Nightly job fallback conditions are unreachable on schedule events

    The inner || branches — needs.pre-flight.outputs.is_ci_workload == 'true' and needs.pre-flight.outputs.force_run_all == 'true' — are only reached when success() is false (i.e., when cicd-container-build fails). On a scheduled run these pre-flight outputs will be 'false' because they are driven by PR labels. In practice, if the container build fails on a nightly run, no branch of this condition will be true and the nightly tests will be skipped, which is the desired behavior — but the extra branches create a false impression that they could help recover from a build failure. Consider simplifying to:

Last reviewed commit: "isolate artifact"

Signed-off-by: Hao Wu <skyw@nvidia.com>
@skyw skyw marked this pull request as ready for review March 20, 2026 22:49
@skyw skyw requested a review from a team as a code owner March 20, 2026 22:49
Signed-off-by: Hao Wu <skyw@nvidia.com>
@skyw skyw merged commit 8a44b09 into main Mar 20, 2026
4 checks passed
@skyw skyw deleted the skyw/nightly_ci branch March 20, 2026 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants