[Enhancement] Network fabric monitoring: IB counters, congestion metrics, and RDMA health for HPC clusters

### A clear and concise description of what you want to happen.

The current network-related health checks (`check_iblink`, `check_hca`, `check_ethlink`, `check_pci`) do a solid job validating link state and physical connectivity, but they stop at the "is the link up?" level. In large-scale HPC/AI training environments with hundreds or thousands of nodes, **network fabric health goes far beyond link presence** — and degraded fabric performance is one of the most common (and hardest to diagnose) causes of poor collective communication and training slowdowns.

This proposal suggests adding optional network fabric monitoring that covers the runtime performance side of the interconnect, complementing the existing link-level checks.

#### Proposed checks / collectors

**1. InfiniBand Port Counters (`check_ib_counters`)**
- Read from `/sys/class/infiniband/{device}/ports/{port}/counters/`
- Track error counters: `SymbolErrorCounter`, `LinkErrorRecoveryCounter`, `LinkDownedCounter`, `PortRcvErrors`, `PortRcvConstraintErrors`, `PortXmitDiscards`, `ExcessiveBufferOverrunErrors`, `LocalLinkIntegrityErrors`
- Track throughput counters: `PortXmitData`, `PortRcvData`, `PortXmitPkts`, `PortRcvPkts`
- Alert on non-zero or rapidly increasing error counters (configurable thresholds via manifest)
- This is the single highest-value addition — non-zero error counters are the #1 indicator of fabric issues that silently degrade NCCL performance

**2. IB Congestion / Adaptive Routing state**
- Verify ECN (Explicit Congestion Notification) configuration on HCA ports
- Check adaptive routing status when supported by the SM (e.g., UFM-managed fabrics)
- Validate QoS / SL (Service Level) configuration matches expected topology

**3. RDMA Resource Health (`check_rdma`)**
- Query `rdma` or sysfs for active QP (Queue Pair) count and state
- Detect excessive QP errors or QPs stuck in error state
- Validate RDMA device capabilities against expected configuration (e.g., RoCEv2 vs IB mode, GID table sanity)

**4. IB Subnet Manager Reachability**
- Validate that the node can reach the SM (via `sminfo` or sysfs `sm_lid`)
- Detect SM flapping (multiple SM changes in a short window) — a common source of intermittent fabric instability
- Validate SM priority/GUID consistency across HA SM pairs

**5. Network Performance Telemetry (monitoring collector)**
- Periodic IB counter sampling and delta computation (rate of errors/traffic)
- Export as OTel metrics: `ib.port.xmit_data_rate`, `ib.port.rcv_errors_rate`, etc.
- Integration with existing `TelemetryContext` / sink infrastructure
- Could live under `gcm/monitoring/cli/` as `ib_fabric_monitor.py`

#### Why this matters for HPC/AI workloads

In a multi-node training job, **a single node with a degraded IB port can bottleneck the entire collective** (AllReduce, AllGather). The existing checks will pass — the link is physically up, the rate matches, the firmware is correct — but the port may be dropping packets, hitting CRC errors, or suffering from congestion due to misconfigured routing. These issues show up as:

- Unexplained training slowdowns (5-30% throughput regression)
- Intermittent NCCL timeouts that are hard to reproduce
- "Stragglers" in profiling traces with no obvious GPU or storage cause

Having fabric-level visibility in GCM would let operators catch these issues during node health validation (pre-job) and during runtime monitoring, rather than debugging after a 10-hour training run fails at step 45,000.

#### Suggested implementation approach

The IB port counters check fits naturally as a new health check following the existing pattern:
- Protocol-based `CheckEnv` with sysfs reads (like `check_iblink` already does for link state)
- Manifest-driven thresholds for what constitutes acceptable error counts
- Killswitch via `FeatureValueHealthChecksFeatures`

The telemetry collector could follow the same pattern as `nvml_monitor.py` — periodic sampling with delta computation.

### A clear and concise description of any alternative solutions or features you've considered, if any.

- **UFM / NVIDIA Fabric Manager integration**: Some of this data is available through UFM's REST API, but that creates an external dependency and not all deployments use UFM. Reading directly from sysfs keeps it self-contained, consistent with how `check_iblink` already works.
- **perfquery / ibdiagnet wrappers**: Could shell out to `perfquery` for counter reads, but sysfs is faster, doesn't require additional tools, and avoids the subprocess overhead in tight monitoring loops.
- **Existing check_iblink expansion**: Could add counters to `check_iblink`, but that check is already complex (~300 lines). A separate check keeps responsibilities clear and allows independent killswitch control.

### Additional context

This is especially relevant for clusters running large-scale distributed training (FSDP, DDP) where NCCL's performance is directly tied to fabric health. The pattern is well-established in production HPC environments — tools like `ibqueryerrors` exist precisely because link-up != link-healthy.

Related: the existing `check_pci` already validates PCIe link width/speed to the HCA, so the full path from GPU → PCIe → HCA → IB fabric → SM would be covered end-to-end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Network fabric monitoring: IB counters, congestion metrics, and RDMA health for HPC clusters #103

A clear and concise description of what you want to happen.

Proposed checks / collectors

Why this matters for HPC/AI workloads

Suggested implementation approach

A clear and concise description of any alternative solutions or features you've considered, if any.

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Enhancement] Network fabric monitoring: IB counters, congestion metrics, and RDMA health for HPC clusters #103

Description

A clear and concise description of what you want to happen.

Proposed checks / collectors

Why this matters for HPC/AI workloads

Suggested implementation approach

A clear and concise description of any alternative solutions or features you've considered, if any.

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions