You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A clear and concise description of what you want to happen.
The current network-related health checks (check_iblink, check_hca, check_ethlink, check_pci) do a solid job validating link state and physical connectivity, but they stop at the "is the link up?" level. In large-scale HPC/AI training environments with hundreds or thousands of nodes, network fabric health goes far beyond link presence — and degraded fabric performance is one of the most common (and hardest to diagnose) causes of poor collective communication and training slowdowns.
This proposal suggests adding optional network fabric monitoring that covers the runtime performance side of the interconnect, complementing the existing link-level checks.
Proposed checks / collectors
1. InfiniBand Port Counters (check_ib_counters)
Read from /sys/class/infiniband/{device}/ports/{port}/counters/
Alert on non-zero or rapidly increasing error counters (configurable thresholds via manifest)
This is the single highest-value addition — non-zero error counters are the Bump pip-tools from 7.4.1 to 7.5.2 #1 indicator of fabric issues that silently degrade NCCL performance
2. IB Congestion / Adaptive Routing state
Verify ECN (Explicit Congestion Notification) configuration on HCA ports
Check adaptive routing status when supported by the SM (e.g., UFM-managed fabrics)
Periodic IB counter sampling and delta computation (rate of errors/traffic)
Export as OTel metrics: ib.port.xmit_data_rate, ib.port.rcv_errors_rate, etc.
Integration with existing TelemetryContext / sink infrastructure
Could live under gcm/monitoring/cli/ as ib_fabric_monitor.py
Why this matters for HPC/AI workloads
In a multi-node training job, a single node with a degraded IB port can bottleneck the entire collective (AllReduce, AllGather). The existing checks will pass — the link is physically up, the rate matches, the firmware is correct — but the port may be dropping packets, hitting CRC errors, or suffering from congestion due to misconfigured routing. These issues show up as:
Unexplained training slowdowns (5-30% throughput regression)
Intermittent NCCL timeouts that are hard to reproduce
"Stragglers" in profiling traces with no obvious GPU or storage cause
Having fabric-level visibility in GCM would let operators catch these issues during node health validation (pre-job) and during runtime monitoring, rather than debugging after a 10-hour training run fails at step 45,000.
Suggested implementation approach
The IB port counters check fits naturally as a new health check following the existing pattern:
Protocol-based CheckEnv with sysfs reads (like check_iblink already does for link state)
Manifest-driven thresholds for what constitutes acceptable error counts
Killswitch via FeatureValueHealthChecksFeatures
The telemetry collector could follow the same pattern as nvml_monitor.py — periodic sampling with delta computation.
A clear and concise description of any alternative solutions or features you've considered, if any.
UFM / NVIDIA Fabric Manager integration: Some of this data is available through UFM's REST API, but that creates an external dependency and not all deployments use UFM. Reading directly from sysfs keeps it self-contained, consistent with how check_iblink already works.
perfquery / ibdiagnet wrappers: Could shell out to perfquery for counter reads, but sysfs is faster, doesn't require additional tools, and avoids the subprocess overhead in tight monitoring loops.
Existing check_iblink expansion: Could add counters to check_iblink, but that check is already complex (~300 lines). A separate check keeps responsibilities clear and allows independent killswitch control.
Additional context
This is especially relevant for clusters running large-scale distributed training (FSDP, DDP) where NCCL's performance is directly tied to fabric health. The pattern is well-established in production HPC environments — tools like ibqueryerrors exist precisely because link-up != link-healthy.
Related: the existing check_pci already validates PCIe link width/speed to the HCA, so the full path from GPU → PCIe → HCA → IB fabric → SM would be covered end-to-end.
A clear and concise description of what you want to happen.
The current network-related health checks (
check_iblink,check_hca,check_ethlink,check_pci) do a solid job validating link state and physical connectivity, but they stop at the "is the link up?" level. In large-scale HPC/AI training environments with hundreds or thousands of nodes, network fabric health goes far beyond link presence — and degraded fabric performance is one of the most common (and hardest to diagnose) causes of poor collective communication and training slowdowns.This proposal suggests adding optional network fabric monitoring that covers the runtime performance side of the interconnect, complementing the existing link-level checks.
Proposed checks / collectors
1. InfiniBand Port Counters (
check_ib_counters)/sys/class/infiniband/{device}/ports/{port}/counters/SymbolErrorCounter,LinkErrorRecoveryCounter,LinkDownedCounter,PortRcvErrors,PortRcvConstraintErrors,PortXmitDiscards,ExcessiveBufferOverrunErrors,LocalLinkIntegrityErrorsPortXmitData,PortRcvData,PortXmitPkts,PortRcvPkts2. IB Congestion / Adaptive Routing state
3. RDMA Resource Health (
check_rdma)rdmaor sysfs for active QP (Queue Pair) count and state4. IB Subnet Manager Reachability
sminfoor sysfssm_lid)5. Network Performance Telemetry (monitoring collector)
ib.port.xmit_data_rate,ib.port.rcv_errors_rate, etc.TelemetryContext/ sink infrastructuregcm/monitoring/cli/asib_fabric_monitor.pyWhy this matters for HPC/AI workloads
In a multi-node training job, a single node with a degraded IB port can bottleneck the entire collective (AllReduce, AllGather). The existing checks will pass — the link is physically up, the rate matches, the firmware is correct — but the port may be dropping packets, hitting CRC errors, or suffering from congestion due to misconfigured routing. These issues show up as:
Having fabric-level visibility in GCM would let operators catch these issues during node health validation (pre-job) and during runtime monitoring, rather than debugging after a 10-hour training run fails at step 45,000.
Suggested implementation approach
The IB port counters check fits naturally as a new health check following the existing pattern:
CheckEnvwith sysfs reads (likecheck_iblinkalready does for link state)FeatureValueHealthChecksFeaturesThe telemetry collector could follow the same pattern as
nvml_monitor.py— periodic sampling with delta computation.A clear and concise description of any alternative solutions or features you've considered, if any.
check_iblinkalready works.perfqueryfor counter reads, but sysfs is faster, doesn't require additional tools, and avoids the subprocess overhead in tight monitoring loops.check_iblink, but that check is already complex (~300 lines). A separate check keeps responsibilities clear and allows independent killswitch control.Additional context
This is especially relevant for clusters running large-scale distributed training (FSDP, DDP) where NCCL's performance is directly tied to fabric health. The pattern is well-established in production HPC environments — tools like
ibqueryerrorsexist precisely because link-up != link-healthy.Related: the existing
check_pcialready validates PCIe link width/speed to the HCA, so the full path from GPU → PCIe → HCA → IB fabric → SM would be covered end-to-end.