Skip to content

Conversation

akashveramd
Copy link
Collaborator

This PR is based out of the original PR #1260.
The original PR was created in a different fork, and it was having issues setting up aws inside the workflow. Since the workflow was running from a forked PR.

…Fixed error in integration_tests.py. Fixed lint errors.
@akashveramd akashveramd self-assigned this Oct 2, 2025
@akashveramd akashveramd requested a review from wconstab as a code owner October 2, 2025 18:42
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 2, 2025
Copy link

pytorch-bot bot commented Oct 2, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is a bit generic. Can be name it integration_test_8gpu_features_amd.yaml or integration_test_8gpu_features_rocm.yaml?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tianyu-l: The reason I didn't include amd / rocm or features tag in naming the file, because the integration_test_8gpu.yaml runs tests for both rocm & cuda. Currently it runs only features tests because that is what we have tested so far on rocm. However, in future we would prefer to use the same workflow file to also run other tests that runs for cuda, so as to reduce maintenance overhead. In that case we will remove --test_suite features tag from the command which runs the integration tests and bring in additional changes to support running other tests on both rocm and cuda.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we have multiple yaml files for cuda testing is because

  1. they (feature tests, model tests, simplefsdp tests, etc.) can be run in parallel, to not block dev efficiency
  2. we could run certain test only when some relevant files/folders are touched https://github.com/pytorch/torchtitan/pull/1786/files#diff-e327f3f247423713ee949ef4eef6b82de392abca8c53137159d82f073510c4f9R3-R10

I don't think these could be done if we merge everything together.

Copy link
Collaborator Author

@akashveramd akashveramd Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tianyu-l: In that case, instead of renaming, I can run the rocm workflow inside the existing integration_test_8gpu_features.yaml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will they run sequentially or in parallel?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since they are defined using matrix strategy, they should be created as two separate jobs running on two different runners. So, they should run in parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/rocm CLA Signed This label is managed by the Meta Open Source bot. module: rocm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants