Skip to content

Support manual resource elastic for allreduce#1714

Open
yifeng-x wants to merge 3 commits intointelligent-machine-learning:masterfrom
yifeng-x:allreduce_resource_elastic
Open

Support manual resource elastic for allreduce#1714
yifeng-x wants to merge 3 commits intointelligent-machine-learning:masterfrom
yifeng-x:allreduce_resource_elastic

Conversation

@yifeng-x
Copy link
Copy Markdown

What changes were proposed in this pull request?

Support dynamic manual adjustment of training task resources (CPU/memory/GPU count) at runtime.

Why are the changes needed?

When training exceptions occur due to insufficient resources, the system can automatically adjust resources and restart training, such as in the case of OOM.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

yifeng-x and others added 3 commits April 20, 2026 11:19
…-machine-learning#1718)

* stash 20260304

* add link to detail

* add mock page

* add job config

* add job config

* done failover consanguinity

* fix

* rm dashboard service

* optimized html

* optimized

* rm md

* optimized

* optimized

* add dashboard configuration

* update test config

* fix

* fix

* retrigger

* lint

* optimized

* fix ut
@yifeng-x yifeng-x force-pushed the allreduce_resource_elastic branch from e4e1df1 to d015efd Compare April 20, 2026 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants