facebookresearch · theap06 · Mar 11, 2026 · Mar 11, 2026 · Mar 11, 2026 · Mar 12, 2026
@@ -1,5 +1,5 @@
 [flake8]
-ignore = E501, W503, N806, N813, N818, E704
+ignore = E501, W503, N806, N813, N818, E704, E203
 max-line-length = 88
 exclude = gcm/gen,*.pyi
 # relative imports break buck

@@ -46,7 +46,7 @@ Facebook has adopted a Code of Conduct that we expect project participants to ad
 
 ## The Team
 
-GPU Cluster Monitoring is actively maintained by [Lucca Bertoncini](https://github.com/luccabb), [Caleb Ho](https://github.com/calebho), [Apostolos Kokolis](https://github.com/A-Kokolis), [Liao Hu](https://github.com/L1A0), [Thanh Nguyen](https://github.com/giongto35), [Billy Campoli](https://github.com/tooji) with a number of contributions coming from talented individuals (in no particular order, and non-exhaustive): [Jörg Doku](https://github.com/Jorghi12), [Vivian Peng](https://github.com/vzpeng), [Parth Malani](https://github.com/pmmalani), [Kalyan Saladi](https://github.com/skalyan), [Shubho Sengupta](https://github.com/shubho), [Leo Huang](https://github.com/lifeihuang), [Robert Vincent](https://github.com/bvincent-penguin), [Max Wang](https://github.com/mxw), [Sujit Verma](https://github.com/sujitoc), [Teng Li](https://github.com/teng-li), [James Taylor](https://github.com/jamestaylr), [Xiaodong Ma](https://github.com/xman1979), [Chris Henry](https://github.com/chenry3), [Jakob Johnson](https://github.com/jj10306), [Kareem Sakher](https://github.com/kjsakher), [Abinesh Ramakrishnan](https://github.com/ibanesh), [Nabib Ahmed](https://github.com/nahmed3536), [Yong Li](https://github.com/yonglimeta), [Junjie Qian](https://github.com/junjieqian), [David Watson](https://github.com/davidewatson), [Guanyu Wu](https://github.com/kwu-penguin), [Jaromir Latal](https://github.com/jermenkoo), [Samuel Doud](https://github.com/SamuelDoud), [Yidi Wu](https://github.com/ydwu4), [Xinyuan Zhang](https://github.com/xinyuanzzz), [Neha Saxena](https://github.com/nehasaxena210), [Gustavo Lima](https://github.com/gustcol), [Quang D. Nguyen](https://github.com/qu-ngx).
+GPU Cluster Monitoring is actively maintained by [Lucca Bertoncini](https://github.com/luccabb), [Caleb Ho](https://github.com/calebho), [Apostolos Kokolis](https://github.com/A-Kokolis), [Liao Hu](https://github.com/L1A0), [Thanh Nguyen](https://github.com/giongto35), [Billy Campoli](https://github.com/tooji) with a number of contributions coming from talented individuals (in no particular order, and non-exhaustive): [Jörg Doku](https://github.com/Jorghi12), [Vivian Peng](https://github.com/vzpeng), [Parth Malani](https://github.com/pmmalani), [Kalyan Saladi](https://github.com/skalyan), [Shubho Sengupta](https://github.com/shubho), [Leo Huang](https://github.com/lifeihuang), [Robert Vincent](https://github.com/bvincent-penguin), [Max Wang](https://github.com/mxw), [Sujit Verma](https://github.com/sujitoc), [Teng Li](https://github.com/teng-li), [James Taylor](https://github.com/jamestaylr), [Xiaodong Ma](https://github.com/xman1979), [Chris Henry](https://github.com/chenry3), [Jakob Johnson](https://github.com/jj10306), [Kareem Sakher](https://github.com/kjsakher), [Abinesh Ramakrishnan](https://github.com/ibanesh), [Nabib Ahmed](https://github.com/nahmed3536), [Yong Li](https://github.com/yonglimeta), [Junjie Qian](https://github.com/junjieqian), [David Watson](https://github.com/davidewatson), [Guanyu Wu](https://github.com/kwu-penguin), [Jaromir Latal](https://github.com/jermenkoo), [Samuel Doud](https://github.com/SamuelDoud), [Yidi Wu](https://github.com/ydwu4), [Xinyuan Zhang](https://github.com/xinyuanzzz), [Neha Saxena](https://github.com/nehasaxena210), [Achintya Paningapalli](https://github.com/theap06), [Gustavo Lima](https://github.com/gustcol), [Quang D. Nguyen](https://github.com/qu-ngx).
 
 Feel free to contribute and add your name!
 

@@ -0,0 +1,60 @@
+# Test Plan: Accelerator HAL Migration
+
+This document outlines the test plan to verify that the migration to the Accelerator HAL (Hardware Abstraction Layer) preserves existing functionality for NVML-based monitoring and health checks.
+
+## Objective
+
+Ensure that all existing NVML paths (`nvml_monitor` and `check_nvidia_smi`) continue to function identically after being refactored to use the `AcceleratorManager` and `NVMLBackend` interface.
+
+## Coverage Areas
+
+1.  **Metric Collection (`nvml_monitor`)**: Verifying GPU metrics (utilization, memory, power, temperature, clocks, ECC) are collected correctly.
+2.  **Health Checks (`check_nvidia_smi`)**: Verifying GPU presence, running processes, and error detection.
+3.  **Error Handling**: Ensuring that backend unavailability or device errors are handled gracefully and logged appropriately.
+
+## Test Cases
+
+### 1. Unit Tests
+
+Run existing unit tests to verify no regressions in logic.
+
+```bash
+pytest gcm/tests/test_accelerator_hal.py
+pytest gcm/tests/health_checks_tests/test_check_nvidia_smi.py
+pytest gcm/tests/test_nvml_monitor.py
+```
+
+### 2. Manual Verification (Stubbed)
+
+Since we cannot run on actual GPU hardware in this environment, we rely on the stubbed NVML library used in tests.
+
+#### A. NVML Monitor
+
+**Refactored Logic:**
+`nvml_monitor` now instantiates `AcceleratorManager`, probes backends, and uses `AcceleratorTelemetryAdapter` to interact with device handles provided by `NVMLBackend`.
+
+**Verification Step:**
+Verify that `nvml_monitor.py` correctly fetches device count and metrics via the adapter. The adapter ensures that underlying `pynvml` calls are routed through the `AcceleratorManager`'s backend instance.
+
+#### B. Health Checks
+
+**Refactored Logic:**
+`check_nvidia_smi` now instantiates `AcceleratorManager` and uses `AcceleratorTelemetryAdapter` to perform checks.
+
+**Verification Step:**
+Verify that `check_nvidia_smi.py` correctly detects GPU count and running processes via the adapter.
+Also verified `gcm/tests/test_gcm.py::test_health_checks_backend_nvml_full_run` which exercises the full health check loop. Updated test to handle potential extra output from check execution.
+
+## Refactoring Status
+
+-   **`gcm/accelerator`**: Core HAL interfaces and NVML backend implementation are complete.
+-   **`nvml_monitor.py`**: Refactored to use `AcceleratorManager` via `AcceleratorTelemetryAdapter`.
+-   **`check_nvidia_smi.py`**: Refactored to use `AcceleratorManager` via `AcceleratorTelemetryAdapter`.
+-   **Legacy Shim**: Added `gcm/monitoring/accelerator_adapter.py` to bridge `DeviceTelemetryClient` calls to the HAL backend, ensuring 100% backward compatibility for methods not yet fully exposed in `MetricSet` (e.g., specific ECC error counts).
+
+## Rollout Strategy
+
+1.  **Phase 1 (Current PR)**: Introduce HAL, migrate all NVML usage to `AcceleratorManager` via adapter shim.
+2.  **Phase 2 (Future)**: Update `nvml_monitor` logic to use `AcceleratorManager.read_metrics()` directly, removing dependency on `DeviceTelemetryClient` interface once `MetricSet` is expanded to cover all needs.
+
+This incremental approach ensures that the new architecture is active immediately while minimizing risk to existing business logic.
@@ -0,0 +1,30 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+from gcm.accelerator.backend import (
+    AcceleratorBackend,
+    BackendName,
+    DeviceHandle,
+    ProbeResult,
+)
+from gcm.accelerator.errors import (
+    AcceleratorError,
+    BackendUnavailableError,
+    UnsupportedOperationError,
+)
+from gcm.accelerator.manager import AcceleratorManager
+from gcm.accelerator.metrics import MetricRequest, MetricSet
+from gcm.accelerator.registry import default_backend_factories
+
+__all__ = [
+    "AcceleratorBackend",
+    "AcceleratorError",
+    "AcceleratorManager",
+    "BackendName",
+    "BackendUnavailableError",
+    "DeviceHandle",
+    "MetricRequest",
+    "MetricSet",
+    "ProbeResult",
+    "UnsupportedOperationError",
+    "default_backend_factories",
+]
@@ -0,0 +1,49 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+from dataclasses import dataclass, field
+from datetime import datetime, timezone
+from enum import Enum
+from typing import Callable, List, Protocol
+
+from gcm.accelerator.metrics import MetricRequest, MetricSet
+
+
+class BackendName(str, Enum):
+    NVML = "nvml"
+
+
+@dataclass(frozen=True)
+class ProbeResult:
+    backend: BackendName
+    healthy: bool
+    reason: str
+    library_path: str | None = None
+    driver_version: str | None = None
+    probed_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
+
+
+@dataclass(frozen=True)
+class DeviceHandle:
+    backend: BackendName
+    id: str
+    vendor: str
+    model: str | None = None
+    bus_id: str | None = None
+    serial: str | None = None
+
+
+class AcceleratorBackend(Protocol):
+    def name(self) -> BackendName: ...
+
+    def probe(self) -> ProbeResult: ...
+
+    def enumerate_devices(self) -> List[DeviceHandle]: ...
+
+    def read_metrics(
+        self, device: DeviceHandle, request: MetricRequest
+    ) -> MetricSet: ...
+
+    def close(self) -> None: ...
+
+
+BackendFactory = Callable[[], AcceleratorBackend]
@@ -0,0 +1,2 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
@@ -0,0 +1,187 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+from dataclasses import dataclass, field
+from datetime import datetime, timezone
+from typing import Any, Callable, Optional, TypeVar
+
+from gcm.accelerator.backend import BackendName, DeviceHandle, ProbeResult
+from gcm.accelerator.errors import BackendUnavailableError, UnsupportedOperationError
+from gcm.accelerator.metrics import MetricRequest, MetricSet
+from gcm.accelerator.probe import find_and_load_library
+from gcm.monitoring.device_telemetry_client import (
+    DeviceTelemetryClient,
+    DeviceTelemetryException,
+)
+from gcm.monitoring.utils.error import safe_call
+from gcm.schemas.gpu.application_clock import ApplicationClockInfo
+
+from gcm.schemas.gpu.memory import GPUMemory
+from gcm.schemas.gpu.utilization import GPUUtilization
+
+_NAMES = ["nvidia-ml"]
+_PATHS = [
+    "/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1",
+    "/usr/lib64/libnvidia-ml.so.1",
+    "/usr/lib/libnvidia-ml.so.1",
+]
+
+_T = TypeVar("_T")
+
+
+def _default_nvml_client_factory() -> DeviceTelemetryClient:
+    # Keep the import lazy so this package can still be imported in
+    # environments where pynvml is unavailable.
+    from gcm.monitoring.device_telemetry_nvml import NVMLDeviceTelemetryClient
+
+    return NVMLDeviceTelemetryClient()
+
+
+@dataclass
+class NVMLBackend:
+    telemetry_client_factory: Callable[[], DeviceTelemetryClient] = (
+        _default_nvml_client_factory
+    )
+    _client: Optional[DeviceTelemetryClient] = field(
+        default=None, init=False, repr=False
+    )
+    _handles: dict[str, Any] = field(default_factory=dict, init=False, repr=False)
+
+    def name(self) -> BackendName:
+        return BackendName.NVML
+
+    def _ensure_client(self) -> DeviceTelemetryClient:
+        if self._client is None:
+            self._client = self.telemetry_client_factory()
+        return self._client
+
+    def probe(self) -> ProbeResult:
+        path = find_and_load_library(_NAMES, _PATHS)
+        if path is None:
+            raise BackendUnavailableError("NVML shared library not found")
+        client = self._ensure_client()
+        try:
+            client.get_device_count()
+        except DeviceTelemetryException as e:
+            raise BackendUnavailableError("NVML initialization failed") from e
+        return ProbeResult(
+            backend=self.name(),
+            healthy=True,
+            reason="ready",
+            library_path=path,
+            probed_at=datetime.now(timezone.utc),
+        )
+
+    def enumerate_devices(self) -> list[DeviceHandle]:
+        client = self._ensure_client()
+        try:
+            device_count = client.get_device_count()
+            devices: list[DeviceHandle] = []
+            for index in range(device_count):
+                model: Optional[str] = None
+
+                # Check cache first or fetch handle
+                dev_id = str(index)
+                if dev_id in self._handles:
+                    handle = self._handles[dev_id]
+                else:
+                    handle = client.get_device_by_index(index)
+                    self._handles[dev_id] = handle
+
+                model_getter = getattr(handle, "get_name", None)
+                if callable(model_getter):
+                    maybe_model = self._safe_call(model_getter)
+                    if isinstance(maybe_model, str):
+                        model = maybe_model
+                devices.append(
+                    DeviceHandle(
+                        backend=self.name(),
+                        id=dev_id,
+                        vendor="nvidia",
+                        model=model,
+                    )
+                )
+            return devices
+        except DeviceTelemetryException as e:
+            raise UnsupportedOperationError("NVML enumerate_devices failed") from e
+
+    @staticmethod
+    def _safe_call(func: Callable[[], _T]) -> _T | None:
+        return safe_call(func, DeviceTelemetryException, logger_name=__name__)
+
+    def read_metrics(self, device: DeviceHandle, _request: MetricRequest) -> MetricSet:
+        # TODO: Wire MetricRequest.include_process_info once process telemetry
+        # is available through HAL MetricSet.
+        client = self._ensure_client()
+        try:
+            if device.id in self._handles:
+                handle = self._handles[device.id]
+            else:
+                index = int(device.id)
+                handle = client.get_device_by_index(index)
+                self._handles[device.id] = handle
+        except (ValueError, DeviceTelemetryException) as e:
+            raise UnsupportedOperationError(
+                f"invalid NVML device id: {device.id}"
+            ) from e
+
+        utilization: GPUUtilization | None = self._safe_call(
+            handle.get_utilization_rates
+        )
+        memory: GPUMemory | None = self._safe_call(handle.get_memory_info)
+        temperature: int | None = self._safe_call(handle.get_temperature)
+        power_usage: int | None = self._safe_call(handle.get_power_usage)
+        power_limit: int | None = self._safe_call(handle.get_enforced_power_limit)
+        clocks: ApplicationClockInfo | None = self._safe_call(handle.get_clock_freq)
+        ecc_corrected: int | None = self._safe_call(
+            handle.get_ecc_corrected_volatile_total
+        )
+        ecc_uncorrected: int | None = self._safe_call(
+            handle.get_ecc_uncorrected_volatile_total
+        )
+
+        return MetricSet(
+            timestamp=datetime.now(timezone.utc),
+            core_util_pct=(float(utilization.gpu) if utilization is not None else None),
+            mem_util_pct=(
+                float(utilization.memory) if utilization is not None else None
+            ),
+            mem_total_bytes=(int(memory.total) if memory is not None else None),
+            mem_used_bytes=(int(memory.used) if memory is not None else None),
+            temp_c=(float(temperature) if temperature is not None else None),
+            power_w=(float(power_usage) / 1000.0 if power_usage is not None else None),
+            power_limit_w=(
+                float(power_limit) / 1000.0 if power_limit is not None else None
+            ),
+            sm_clock_mhz=(int(clocks.graphics_freq) if clocks is not None else None),
+            mem_clock_mhz=(int(clocks.memory_freq) if clocks is not None else None),
+            ecc_corrected=(int(ecc_corrected) if ecc_corrected is not None else None),
+            ecc_uncorrected=(
+                int(ecc_uncorrected) if ecc_uncorrected is not None else None
+            ),
+        )
+
+    def get_raw_handle(self, device_id: str) -> Any:
+        client = self._ensure_client()
+        if device_id in self._handles:
+            return self._handles[device_id]
+
+        try:
+            index = int(device_id)
+            handle = client.get_device_by_index(index)
+            self._handles[device_id] = handle
+            return handle
+        except (ValueError, DeviceTelemetryException) as e:
+            raise UnsupportedOperationError(
+                f"invalid NVML device id: {device_id}"
+            ) from e
+
+    def close(self) -> None:
+        client = self._client
+        self._client = None
+        if client is None:
+            return None
+
+        close_method = getattr(client, "close", None)
+        if callable(close_method):
+            close_method()
+        return None
@@ -0,0 +1,26 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+from dataclasses import dataclass
+
+from gcm.accelerator.backend import BackendName
+
+
+class AcceleratorError(Exception):
+    """Base exception type for accelerator HAL failures."""
+
+
+class BackendUnavailableError(AcceleratorError):
+    """Raised when backend probe fails due to missing runtime dependencies."""
+
+
+class UnsupportedOperationError(AcceleratorError):
+    """Raised when an operation is not implemented by a backend."""
+
+
+@dataclass(frozen=True)
+class BackendOperationError(AcceleratorError):
+    backend: BackendName
+    operation: str
+
+    def __str__(self) -> str:
+        return f"backend={self.backend.value} operation={self.operation}"
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.
		# All rights reserved.