Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
e7a9e0b
Added support to run torchtitan tests on ROCm.
akashveramd Jun 4, 2025
04a1718
Added rocm ci support for integration_test_h100.
akashveramd Jun 5, 2025
7894f3f
Fixed a bug in build script. Removed ubuntu-cuda folder, instead usin…
akashveramd Jun 7, 2025
041c04b
Added tests.integration_tests.features during rebase.
akashveramd Jun 11, 2025
19863fb
Modified docker-builds.yml to build rocm docker image for torchtitan.
akashveramd Jun 13, 2025
cacfd75
Fixed runner for cuda and rocm images in docker-builds.yml.
akashveramd Jun 18, 2025
0f89cb6
Added TEST_WITH_ROCM environment variable for running tests on rocm. …
akashveramd Jun 19, 2025
21838e0
Made additional changes to tests.integration_tests.features during re…
akashveramd Jun 24, 2025
98c7a65
Changed runner to i-0962598bd0e8298b3 for building ROCm docker image.
akashveramd Jun 29, 2025
9a28776
Changed runner to linux.12xlarge for building ROCm docker image.
akashveramd Jun 30, 2025
ab45e78
Changed runner to linux.2xlarge for building ROCm docker image.
akashveramd Jun 30, 2025
56bf930
Resolved conflict in .github.workflows.integration_test_8gpu_models d…
akashveramd Jul 3, 2025
74dbc4a
Changed rocm docker image name in docker-builds.yml.
akashveramd Jul 3, 2025
07a4a73
Reverted the changes to integration_test_8gpu_h100.yaml.
akashveramd Jul 9, 2025
be0ecb5
Empty dummy commit.
akashveramd Jul 16, 2025
0f5048e
Increased the timeout to 45 minutes to override timeout used in linux…
akashveramd Jul 17, 2025
7b5dcdf
Empty dummy commit.
akashveramd Jul 17, 2025
2512cf5
Added aws setup in the integration_test_8gpu workflow.
akashveramd Sep 23, 2025
c23e65b
Performed rebase and made changes to include code refactoring done up…
akashveramd Sep 26, 2025
a99db9f
Changed rocm runner name.
akashveramd Sep 26, 2025
3d331bc
Added a change to run build-test after aws-setup.
akashveramd Sep 26, 2025
7d359dd
Changed the test name in integration_test_8gpu.yaml workflow file.
akashveramd Sep 26, 2025
0f5c57f
Fixed id-token permission issue in integration_test_8gpu.yaml.
akashveramd Sep 29, 2025
a8368a2
Added id-token permission issue inside aws-setup job in integration_t…
akashveramd Sep 29, 2025
36fb0e5
To test workflow, switched to 4 GPU runner as they are relatively eas…
akashveramd Sep 29, 2025
1fba2ab
Moved permissions section for id-token outside the aws-setup job.
akashveramd Sep 30, 2025
acbedff
Using move_aws_steps_inside_setup_rocm branch to do aws authenticatio…
akashveramd Oct 6, 2025
f8577d7
Using linux linux_job_v2.yml in akashveramd fork having aws setup onl…
akashveramd Oct 7, 2025
9626083
Using linux_job_v2.yml having id-token write permissions from pytorch…
akashveramd Oct 7, 2025
6985eec
Removed integration_test_8gpu.yaml and added ROCm workflow to run fea…
akashveramd Oct 8, 2025
b9ab5c2
Using 7311 branch for linux_job_v2.yml.
akashveramd Oct 9, 2025
184e8be
Using 4 GPU runner for ROCm.
akashveramd Oct 9, 2025
11cb25d
Switched back to main branch for linux_job_v2.yml.
akashveramd Oct 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion .ci/docker/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,20 @@ shift
echo "Building ${IMAGE_NAME} Docker image"

OS=ubuntu
OS_VERSION=20.04
CLANG_VERSION=""
PYTHON_VERSION=3.11
MINICONDA_VERSION=24.3.0-0

case "${IMAGE_NAME}" in
torchtitan-ubuntu-20.04-clang12)
OS_VERSION=20.04
CLANG_VERSION=12
BASE_IMAGE=nvidia/cuda:12.4.1-cudnn-runtime-ubuntu${OS_VERSION}
;;
torchtitan-rocm-ubuntu-22.04-clang12)
OS_VERSION=22.04
CLANG_VERSION=12
BASE_IMAGE=rocm/dev-ubuntu-${OS_VERSION}:latest
;;
*)
echo "Invalid image name ${IMAGE_NAME}"
Expand All @@ -30,6 +36,7 @@ esac
docker build \
--no-cache \
--progress=plain \
--build-arg "BASE_IMAGE=${BASE_IMAGE}" \
--build-arg "OS_VERSION=${OS_VERSION}" \
--build-arg "CLANG_VERSION=${CLANG_VERSION}" \
--build-arg "PYTHON_VERSION=${PYTHON_VERSION}" \
Expand All @@ -38,3 +45,4 @@ docker build \
-f "${OS}"/Dockerfile \
"$@" \
.

4 changes: 2 additions & 2 deletions .ci/docker/ubuntu/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
ARG OS_VERSION
ARG BASE_IMAGE

FROM nvidia/cuda:12.4.1-cudnn-runtime-ubuntu${OS_VERSION}
FROM ${BASE_IMAGE}

ARG OS_VERSION

Expand Down
7 changes: 5 additions & 2 deletions .github/workflows/docker-builds.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,16 @@ concurrency:

jobs:
docker-build:
runs-on: [self-hosted, linux.2xlarge]
timeout-minutes: 240
strategy:
fail-fast: false
matrix:
include:
- docker-image-name: torchtitan-ubuntu-20.04-clang12
runner: [self-hosted, linux.2xlarge]
- docker-image-name: torchtitan-rocm-ubuntu-22.04-clang12
runner: [linux.2xlarge]
runs-on: ${{ matrix.runner }}
timeout-minutes: 240
env:
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/torchtitan/${{ matrix.docker-image-name }}
steps:
Expand Down
41 changes: 32 additions & 9 deletions .github/workflows/integration_test_8gpu_features.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
name: 8 GPU Feature Tests

on:
push:
branches: [ main ]
Expand All @@ -19,18 +20,40 @@ defaults:
run:
shell: bash -l -eo pipefail {0}

permissions:
id-token: write
contents: read

jobs:
build-test:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
strategy:
matrix:
include:
- name: cuda
runner: linux.g5.48xlarge.nvidia.gpu
gpu-arch-type: cuda
gpu-arch-version: "12.6"
# This image is faster to clone than the default, but it lacks CC needed by triton
# (1m25s vs 2m37s).
docker-image: torchtitan-ubuntu-20.04-clang12
index-url: https://download.pytorch.org/whl/nightly/cu126
is-rocm: 0
- name: rocm
runner: linux.rocm.gpu.gfx942.4
gpu-arch-type: rocm
gpu-arch-version: "6.4"
docker-image: torchtitan-rocm-ubuntu-22.04-clang12
index-url: https://download.pytorch.org/whl/nightly/rocm6.4
is-rocm: 1
with:
runner: linux.g5.48xlarge.nvidia.gpu
gpu-arch-type: cuda
gpu-arch-version: "12.6"
# This image is faster to clone than the default, but it lacks CC needed by triton
# (1m25s vs 2m37s).
docker-image: torchtitan-ubuntu-20.04-clang12
runner: ${{ matrix.runner }}
gpu-arch-type: ${{ matrix.gpu-arch-type }}
gpu-arch-version: ${{ matrix.gpu-arch-version }}
docker-image: ${{ matrix.docker-image }}
repository: pytorch/torchtitan
upload-artifact: outputs
timeout: 45
script: |
set -eux

Expand All @@ -44,9 +67,9 @@ jobs:

pip config --user set global.progress_bar off

python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126
python -m pip install --force-reinstall --pre torch --index-url ${{ matrix.index-url }}

USE_CPP=0 python -m pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
USE_CPP=0 python -m pip install --pre torchao --index-url ${{ matrix.index-url }}

mkdir artifacts-to-be-uploaded
python -m tests.integration_tests.run_tests --test_suite features artifacts-to-be-uploaded --ngpu 8
TEST_WITH_ROCM=${{ matrix.is-rocm }} python -m tests.integration_tests.run_tests --test_suite features artifacts-to-be-uploaded --ngpu 4
11 changes: 11 additions & 0 deletions tests/integration_tests/run_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@
}


# tests skipped for ROCm
skip_for_rocm_test_list = [
"model_only_hf_checkpoint",
]
TEST_WITH_ROCM = os.getenv("TEST_WITH_ROCM", "0") == "1"


def _run_cmd(cmd):
return subprocess.run([cmd], text=True, shell=True)

Expand Down Expand Up @@ -87,6 +94,10 @@ def run_tests(args, test_list: list[OverrideDefinitions]):
if test_flavor.disabled:
continue

# Skip the test for ROCm
if TEST_WITH_ROCM and test_flavor.test_name in skip_for_rocm_test_list:
continue

# Check if we have enough GPUs
if args.ngpu < test_flavor.ngpu:
logger.info(
Expand Down
Loading