Enable ROCm CI support #1786

akashveramd · 2025-10-02T18:42:42Z

This PR is based out of the original PR #1260.
The original PR was created in a different fork, and it was having issues setting up aws inside the workflow. Since the workflow was running from a forked PR.

…g ubuntu folder for cuda Dockerfile.

…Fixed error in integration_tests.py. Fixed lint errors.

…base.

…uring rebase.

…_job_v2.yml for integration_test_8gpu.yaml.

…stream.

…est_8gpu.yaml.

…ily available to run the workflow.

pytorch-bot · 2025-10-02T18:42:49Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

tianyu-l · 2025-10-02T22:16:15Z

.github/workflows/integration_test_8gpu.yaml

The name is a bit generic. Can be name it integration_test_8gpu_features_amd.yaml or integration_test_8gpu_features_rocm.yaml?

@tianyu-l: The reason I didn't include amd / rocm or features tag in naming the file, because the integration_test_8gpu.yaml runs tests for both rocm & cuda. Currently it runs only features tests because that is what we have tested so far on rocm. However, in future we would prefer to use the same workflow file to also run other tests that runs for cuda, so as to reduce maintenance overhead. In that case we will remove --test_suite features tag from the command which runs the integration tests and bring in additional changes to support running other tests on both rocm and cuda.

The reason we have multiple yaml files for cuda testing is because

they (feature tests, model tests, simplefsdp tests, etc.) can be run in parallel, to not block dev efficiency

we could run certain test only when some relevant files/folders are touched https://github.com/pytorch/torchtitan/pull/1786/files#diff-e327f3f247423713ee949ef4eef6b82de392abca8c53137159d82f073510c4f9R3-R10

I don't think these could be done if we merge everything together.

@tianyu-l: In that case, instead of renaming, I can run the rocm workflow inside the existing integration_test_8gpu_features.yaml.

will they run sequentially or in parallel?

Since they are defined using matrix strategy, they should be created as two separate jobs running on two different runners. So, they should run in parallel.

akashveramd added 26 commits September 25, 2025 23:48

Added support to run torchtitan tests on ROCm.

e7a9e0b

Added rocm ci support for integration_test_h100.

04a1718

Fixed a bug in build script. Removed ubuntu-cuda folder, instead usin…

7894f3f

…g ubuntu folder for cuda Dockerfile.

Added tests.integration_tests.features during rebase.

041c04b

Modified docker-builds.yml to build rocm docker image for torchtitan.

19863fb

Fixed runner for cuda and rocm images in docker-builds.yml.

cacfd75

Added TEST_WITH_ROCM environment variable for running tests on rocm. …

0f89cb6

…Fixed error in integration_tests.py. Fixed lint errors.

Made additional changes to tests.integration_tests.features during re…

21838e0

…base.

Changed runner to i-0962598bd0e8298b3 for building ROCm docker image.

98c7a65

Changed runner to linux.12xlarge for building ROCm docker image.

9a28776

Changed runner to linux.2xlarge for building ROCm docker image.

ab45e78

Resolved conflict in .github.workflows.integration_test_8gpu_models d…

56bf930

…uring rebase.

Changed rocm docker image name in docker-builds.yml.

74dbc4a

Reverted the changes to integration_test_8gpu_h100.yaml.

07a4a73

Empty dummy commit.

be0ecb5

Increased the timeout to 45 minutes to override timeout used in linux…

0f5048e

…_job_v2.yml for integration_test_8gpu.yaml.

Empty dummy commit.

7b5dcdf

Added aws setup in the integration_test_8gpu workflow.

2512cf5

Performed rebase and made changes to include code refactoring done up…

c23e65b

…stream.

Changed rocm runner name.

a99db9f

Added a change to run build-test after aws-setup.

3d331bc

Changed the test name in integration_test_8gpu.yaml workflow file.

7d359dd

Fixed id-token permission issue in integration_test_8gpu.yaml.

0f5c57f

Added id-token permission issue inside aws-setup job in integration_t…

a8368a2

…est_8gpu.yaml.

To test workflow, switched to 4 GPU runner as they are relatively eas…

36fb0e5

…ily available to run the workflow.

Moved permissions section for id-token outside the aws-setup job.

1fba2ab

akashveramd self-assigned this Oct 2, 2025

akashveramd requested review from tianyu-l, fegin and wwwjn as code owners October 2, 2025 18:42

akashveramd requested a review from wconstab as a code owner October 2, 2025 18:42

pytorch-bot bot added ciflow/rocm module: rocm labels Oct 2, 2025

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 2, 2025

tianyu-l reviewed Oct 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable ROCm CI support #1786

Enable ROCm CI support #1786

akashveramd commented Oct 2, 2025

Uh oh!

pytorch-bot bot commented Oct 2, 2025

Uh oh!

tianyu-l Oct 2, 2025

Uh oh!

akashveramd Oct 3, 2025

Uh oh!

tianyu-l Oct 3, 2025

Uh oh!

akashveramd Oct 3, 2025 •

edited

Loading

Uh oh!

tianyu-l Oct 3, 2025

Uh oh!

akashveramd Oct 3, 2025

Uh oh!

Uh oh!

Enable ROCm CI support #1786

Are you sure you want to change the base?

Enable ROCm CI support #1786

Conversation

akashveramd commented Oct 2, 2025

Uh oh!

pytorch-bot bot commented Oct 2, 2025

Uh oh!

tianyu-l Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

akashveramd Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

akashveramd Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

akashveramd Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

akashveramd Oct 3, 2025 •

edited

Loading