add xpu monitor for dlrover

# Background
Dlrover is an elastic deep learning framework, with fault-tolerance of processes failure, POD losting etc. Since the LLM training is at large scale and always span for a long time, many errors occur without the processes failure above, but a long time hanging. During the hanging period, the xPU metrics and logs may help to detect such errors

# Requirement
We need xPU metrics monitor running in elastic agent or running as daemonset on each node. The monitor collects xPU metrics such as xPU utilization, memory usage, temperature, tensor core usage, internal traffic such as nvlink and pcie etc. 
Although there are many xPU vendors in market, we can start from Nvidia...


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add xpu monitor for dlrover #1290

Background

Requirement

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

add xpu monitor for dlrover #1290

Description

Background

Requirement

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions