-
Notifications
You must be signed in to change notification settings - Fork 550
Enable ROCm CI support #1786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enable ROCm CI support #1786
Conversation
…g ubuntu folder for cuda Dockerfile.
…Fixed error in integration_tests.py. Fixed lint errors.
…_job_v2.yml for integration_test_8gpu.yaml.
…ily available to run the workflow.
No ciflow labels are configured for this repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is a bit generic. Can be name it integration_test_8gpu_features_amd.yaml
or integration_test_8gpu_features_rocm.yaml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tianyu-l: The reason I didn't include amd / rocm or features tag in naming the file, because the integration_test_8gpu.yaml runs tests for both rocm & cuda. Currently it runs only features tests because that is what we have tested so far on rocm. However, in future we would prefer to use the same workflow file to also run other tests that runs for cuda, so as to reduce maintenance overhead. In that case we will remove --test_suite features tag from the command which runs the integration tests and bring in additional changes to support running other tests on both rocm and cuda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we have multiple yaml files for cuda testing is because
- they (feature tests, model tests, simplefsdp tests, etc.) can be run in parallel, to not block dev efficiency
- we could run certain test only when some relevant files/folders are touched https://github.com/pytorch/torchtitan/pull/1786/files#diff-e327f3f247423713ee949ef4eef6b82de392abca8c53137159d82f073510c4f9R3-R10
I don't think these could be done if we merge everything together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tianyu-l: In that case, instead of renaming, I can run the rocm workflow inside the existing integration_test_8gpu_features.yaml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will they run sequentially or in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since they are defined using matrix strategy, they should be created as two separate jobs running on two different runners. So, they should run in parallel.
This PR is based out of the original PR #1260.
The original PR was created in a different fork, and it was having issues setting up aws inside the workflow. Since the workflow was running from a forked PR.