Skip to content

[Enhancement] Network fabric monitoring: IB counters, congestion metrics, and RDMA health for HPC clusters #103

@gustcol

Description

@gustcol

A clear and concise description of what you want to happen.

The current network-related health checks (check_iblink, check_hca, check_ethlink, check_pci) do a solid job validating link state and physical connectivity, but they stop at the "is the link up?" level. In large-scale HPC/AI training environments with hundreds or thousands of nodes, network fabric health goes far beyond link presence — and degraded fabric performance is one of the most common (and hardest to diagnose) causes of poor collective communication and training slowdowns.

This proposal suggests adding optional network fabric monitoring that covers the runtime performance side of the interconnect, complementing the existing link-level checks.

Proposed checks / collectors

1. InfiniBand Port Counters (check_ib_counters)

  • Read from /sys/class/infiniband/{device}/ports/{port}/counters/
  • Track error counters: SymbolErrorCounter, LinkErrorRecoveryCounter, LinkDownedCounter, PortRcvErrors, PortRcvConstraintErrors, PortXmitDiscards, ExcessiveBufferOverrunErrors, LocalLinkIntegrityErrors
  • Track throughput counters: PortXmitData, PortRcvData, PortXmitPkts, PortRcvPkts
  • Alert on non-zero or rapidly increasing error counters (configurable thresholds via manifest)
  • This is the single highest-value addition — non-zero error counters are the Bump pip-tools from 7.4.1 to 7.5.2 #1 indicator of fabric issues that silently degrade NCCL performance

2. IB Congestion / Adaptive Routing state

  • Verify ECN (Explicit Congestion Notification) configuration on HCA ports
  • Check adaptive routing status when supported by the SM (e.g., UFM-managed fabrics)
  • Validate QoS / SL (Service Level) configuration matches expected topology

3. RDMA Resource Health (check_rdma)

  • Query rdma or sysfs for active QP (Queue Pair) count and state
  • Detect excessive QP errors or QPs stuck in error state
  • Validate RDMA device capabilities against expected configuration (e.g., RoCEv2 vs IB mode, GID table sanity)

4. IB Subnet Manager Reachability

  • Validate that the node can reach the SM (via sminfo or sysfs sm_lid)
  • Detect SM flapping (multiple SM changes in a short window) — a common source of intermittent fabric instability
  • Validate SM priority/GUID consistency across HA SM pairs

5. Network Performance Telemetry (monitoring collector)

  • Periodic IB counter sampling and delta computation (rate of errors/traffic)
  • Export as OTel metrics: ib.port.xmit_data_rate, ib.port.rcv_errors_rate, etc.
  • Integration with existing TelemetryContext / sink infrastructure
  • Could live under gcm/monitoring/cli/ as ib_fabric_monitor.py

Why this matters for HPC/AI workloads

In a multi-node training job, a single node with a degraded IB port can bottleneck the entire collective (AllReduce, AllGather). The existing checks will pass — the link is physically up, the rate matches, the firmware is correct — but the port may be dropping packets, hitting CRC errors, or suffering from congestion due to misconfigured routing. These issues show up as:

  • Unexplained training slowdowns (5-30% throughput regression)
  • Intermittent NCCL timeouts that are hard to reproduce
  • "Stragglers" in profiling traces with no obvious GPU or storage cause

Having fabric-level visibility in GCM would let operators catch these issues during node health validation (pre-job) and during runtime monitoring, rather than debugging after a 10-hour training run fails at step 45,000.

Suggested implementation approach

The IB port counters check fits naturally as a new health check following the existing pattern:

  • Protocol-based CheckEnv with sysfs reads (like check_iblink already does for link state)
  • Manifest-driven thresholds for what constitutes acceptable error counts
  • Killswitch via FeatureValueHealthChecksFeatures

The telemetry collector could follow the same pattern as nvml_monitor.py — periodic sampling with delta computation.

A clear and concise description of any alternative solutions or features you've considered, if any.

  • UFM / NVIDIA Fabric Manager integration: Some of this data is available through UFM's REST API, but that creates an external dependency and not all deployments use UFM. Reading directly from sysfs keeps it self-contained, consistent with how check_iblink already works.
  • perfquery / ibdiagnet wrappers: Could shell out to perfquery for counter reads, but sysfs is faster, doesn't require additional tools, and avoids the subprocess overhead in tight monitoring loops.
  • Existing check_iblink expansion: Could add counters to check_iblink, but that check is already complex (~300 lines). A separate check keeps responsibilities clear and allows independent killswitch control.

Additional context

This is especially relevant for clusters running large-scale distributed training (FSDP, DDP) where NCCL's performance is directly tied to fabric health. The pattern is well-established in production HPC environments — tools like ibqueryerrors exist precisely because link-up != link-healthy.

Related: the existing check_pci already validates PCIe link width/speed to the HCA, so the full path from GPU → PCIe → HCA → IB fabric → SM would be covered end-to-end.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions