-
Notifications
You must be signed in to change notification settings - Fork 59
[CI] Add more ported distributed cases #2082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0d9b54f
to
85fa6f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split the test scope as CI scope and nightly full scope
|
||
inputs: | ||
ut_name: | ||
required: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required: true | |
required: false |
ze = xpu_list[i+1]; | ||
} else { | ||
ze = i; | ||
if [ "${{ inputs.ut_name }}" == "xpu_distributed" ];then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any assumptions in here? Can we detect topology directly and dynamically on the test node?
Please consider below scenarios:
- No Xelink group, return failed
- 1 Xelink group, launch 1 worker
- 2 Xelink group, launch 2 workers
- ...
.github/workflows/_linux_ut.yml
Outdated
runner: | ||
runs-on: ${{ inputs.runner }} | ||
name: get-runner | ||
name: get-runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we have such change?
bd63535
to
61d9eef
Compare
61d9eef
to
7d62aaa
Compare
82b8ed3
to
11e6e4e
Compare
11e6e4e
to
9fab459
Compare
This PR intends to add more ported distributed cases in torch-xpu-ops CI. And add pytest-xdist for distributed UT
The distributed UT time will increase to 2h20min with 2 work groups
(reference: 3h3m for 1 work group https://github.com/intel/torch-xpu-ops/actions/runs/17902859755/job/50907350984)
disable_e2e
disable_ut