Skip to content

add xpu monitor for dlrover #1290

@majieyue

Description

@majieyue

Background

Dlrover is an elastic deep learning framework, with fault-tolerance of processes failure, POD losting etc. Since the LLM training is at large scale and always span for a long time, many errors occur without the processes failure above, but a long time hanging. During the hanging period, the xPU metrics and logs may help to detect such errors

Requirement

We need xPU metrics monitor running in elastic agent or running as daemonset on each node. The monitor collects xPU metrics such as xPU utilization, memory usage, temperature, tensor core usage, internal traffic such as nvlink and pcie etc.
Although there are many xPU vendors in market, we can start from Nvidia...

Metadata

Metadata

Assignees

Labels

featurewipissue or pr with 'wip' will ignore expirationxpu_timer
No fields configured for Feature.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions