Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
0539dbe
Implement ciflow/rocm.
akashveramd Dec 5, 2025
19a5a15
Removed tracking_issue & ciflow_tracking_issue.
akashveramd Dec 5, 2025
a2a55a7
Removed ciflow/rocm tag.
akashveramd Dec 5, 2025
cc29b39
Added condition to run ROCm workflow when label is added.
akashveramd Dec 9, 2025
4485cd1
Moved dynamic matrix creation to a separate reusable workflow. Added …
akashveramd Dec 10, 2025
132a3eb
Using set instead of write in set-matrix.
akashveramd Dec 10, 2025
75a1d53
Removed trailing slash from 'ciflow/8gpu' label.
akashveramd Dec 10, 2025
8f8d9df
Refactored set-matrix.yaml. Added option to create ROCm matrix for pu…
akashveramd Dec 10, 2025
9417b90
Fixed indentation and Lint in set-matrix.yaml.
akashveramd Dec 10, 2025
924444d
Fixed indentation in set-matrix.yaml.
akashveramd Dec 10, 2025
e2b8f12
Run ROCm workflow only when PR label added.
akashveramd Dec 11, 2025
db8cb50
Changed the name to TRIGGERED_8GPU_LABEL. Setting default value for T…
akashveramd Dec 11, 2025
c07b45d
Minor comment change.
akashveramd Dec 11, 2025
a53e303
re-ordered if conditions in set-matrix.yaml.
akashveramd Dec 11, 2025
b5d35a4
Dummy commit
akashveramd Dec 11, 2025
1b429ca
Separated condition for PR label trigger in set-matrix.yaml.
akashveramd Dec 11, 2025
878e01b
DEBUG:rolled back original if conditions in set-matrix.yaml and added…
akashveramd Dec 11, 2025
b001e4f
Dummy commit
akashveramd Dec 11, 2025
17e7fd5
DEBUG: Using 4 GPU ROCm runner.
akashveramd Dec 11, 2025
e54daee
Dummy commit
akashveramd Dec 12, 2025
255ab3d
Using 8 GPU ROCm runner.
akashveramd Dec 12, 2025
ffd1da3
Corrected indentation in labeler.yml. Removed debug statements from s…
akashveramd Dec 12, 2025
c6822c9
Corrected indentation in pytorch-probot.yml.
akashveramd Dec 12, 2025
4fd8306
Simplify pull_request trigger
huydhn Dec 13, 2025
158571f
Don't run both CUDA and ROCm job when ciflow/8gpu is there
huydhn Dec 13, 2025
a2f63ab
Remove HAS_8GPU_LABEL
huydhn Dec 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/labeler.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"ciflow/8gpu":
- .ci/docker/**
- .github/workflows/**
- scripts/**
- tests/**
- torchtitan/**
3 changes: 3 additions & 0 deletions .github/pytorch-probot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ciflow_push_tags:
- ciflow/8gpu
labeler_config: labeler.yml
30 changes: 3 additions & 27 deletions .github/workflows/integration_test_8gpu_features.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ name: 8 GPU Feature Tests
on:
push:
branches: [ main ]
tags:
- ciflow/8gpu/*
Comment on lines +6 to +7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for -- why do we need both pull_request and tags?

Copy link
Collaborator Author

@akashveramd akashveramd Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my understanding the PR workflows and tag workflows are totally independent. Tags provides CI flow, meaning tags can be pushed to trigger CI runs on specific commits even after the PR is closed. They can also be used for versioning releases.
pull_request are used to run workflows when the PR is open.

paths-ignore:
- 'torchtitan/experiments/**'
pull_request:
Expand All @@ -27,33 +29,7 @@ permissions:
jobs:
# Step 1: Dynamically compute the matrix based on conditions
set-matrix:
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.set.outputs.matrix }}
steps:
- id: set
run: |
# Decide which matrix entries to include based on event type
if [[ "${{ github.event_name }}" == "push" && "${{ github.ref }}" == "refs/heads/main" ]] || [[ "${{ github.event_name }}" == "schedule" ]]; then
# Include both CUDA and ROCm
echo '{"include":[
{"name":"cuda","runner":"linux.g5.48xlarge.nvidia.gpu","gpu-arch-type":"cuda","gpu-arch-version":"12.6","docker-image":"torchtitan-ubuntu-20.04-clang12","index-url":"https://download.pytorch.org/whl/nightly/cu126"},
{"name":"rocm","runner":"linux.rocm.gpu.gfx942.8","gpu-arch-type":"rocm","gpu-arch-version":"7.0","docker-image":"torchtitan-rocm-ubuntu-22.04-clang12","index-url":"https://download.pytorch.org/whl/nightly/rocm7.0"}
]}' > matrix.json
else
# Include only CUDA
echo '{"include":[
{"name":"cuda","runner":"linux.g5.48xlarge.nvidia.gpu","gpu-arch-type":"cuda","gpu-arch-version":"12.6","docker-image":"torchtitan-ubuntu-20.04-clang12","index-url":"https://download.pytorch.org/whl/nightly/cu126"}
]}' > matrix.json
fi

# Export matrix to job outputs
{
echo 'matrix<<EOF'
cat matrix.json
echo 'EOF'
} >> $GITHUB_OUTPUT

uses: ./.github/workflows/set-matrix.yaml

# Step 2: Use the dynamic matrix in the build-test job
build-test:
Expand Down
78 changes: 78 additions & 0 deletions .github/workflows/set-matrix.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
name: Set Matrix

on:
workflow_call:
outputs:
matrix:
description: dynamically set matrix
value: ${{ jobs.set.outputs.matrix }}

jobs:
set:
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.set.outputs.matrix }}
env:
# Event flags evaluated by github actions before the step runs:
IS_MAIN_PUSH: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
IS_SCHEDULE: ${{ github.event_name == 'schedule' }}
IS_PR: ${{ github.event_name == 'pull_request' }}
IS_8GPU_TAG: ${{ startsWith(github.ref, 'refs/tags/ciflow/8gpu/') }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this, if we already have the one above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tags and labels are independent and different events. Hence, created a separate variable for it.

TRIGGERED_8GPU_LABEL: ${{ github.event_name == 'pull_request' && github.event.action == 'labeled' }}

steps:
- id: set
run: |
# Define ROCm matrix
ROCM_MATRIX='{
"name": "rocm",
"runner": "linux.rocm.gpu.gfx942.8",
"gpu-arch-type": "rocm",
"gpu-arch-version": "7.0",
"docker-image": "torchtitan-rocm-ubuntu-22.04-clang12",
"index-url": "https://download.pytorch.org/whl/nightly/rocm7.0"
}'

# Define CUDA matrix
CUDA_MATRIX='{
"name": "cuda",
"runner": "linux.g5.48xlarge.nvidia.gpu",
"gpu-arch-type": "cuda",
"gpu-arch-version": "12.6",
"docker-image": "torchtitan-ubuntu-20.04-clang12",
"index-url": "https://download.pytorch.org/whl/nightly/cu126"
}'

# Use default value as 'false' for unset environment variables
IS_MAIN_PUSH="${IS_MAIN_PUSH:-false}"
IS_SCHEDULE="${IS_SCHEDULE:-false}"
IS_PR="${IS_PR:-false}"
IS_8GPU_TAG="${IS_8GPU_TAG:-false}"
TRIGGERED_8GPU_LABEL="${TRIGGERED_8GPU_LABEL:-false}"

# Decide which matrix entries to include based on event type
# Runs ROCm only for push tag OR when PR label gets triggered
if [[ "$IS_8GPU_TAG" == "true" || "$TRIGGERED_8GPU_LABEL" == "true" ]]; then
cat > matrix.json <<JSON
{"include": [$ROCM_MATRIX]}
JSON

# Runs CUDA and ROCm for normal PR (if PR label is present) OR for push to main, cron schedule
elif [[ ("$IS_MAIN_PUSH" == "true" || "$IS_SCHEDULE" == "true") ]]; then
cat > matrix.json <<JSON
{"include": [$CUDA_MATRIX,$ROCM_MATRIX]}
JSON

# Runs CUDA only as default (includes normal PR, if PR label is NOT present)
else
cat > matrix.json <<JSON
{"include": [$CUDA_MATRIX]}
JSON
fi

# Export matrix to job outputs
{
echo 'matrix<<EOF'
cat matrix.json
echo 'EOF'
} >> $GITHUB_OUTPUT
Loading