add python accelerator HAL for multi-vendor backends#86
add python accelerator HAL for multi-vendor backends#86theap06 wants to merge 24 commits intofacebookresearch:mainfrom
Conversation
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
|
Could I get some feedback? This is tackling my aforementioned feature request on supporting other hardware |
luccabb
left a comment
There was a problem hiding this comment.
adds scaffold backends for ROCm, LevelZero, TPU, and Neuron
lets just keep NVML since its the only backend supported in this PR
this should be a full run of the cmd with the corresponding output for each |
44b06c0 to
49fa293
Compare
luccabb
left a comment
There was a problem hiding this comment.
gcm --backend=nvml --sink=stdout ...
healthchecks ... --backend=nvml --sink=stdout ...
this should be a full run of the cmd with the corresponding output for each
|
@luccabb the hardware backend works and the functionality works on my end with a gpu instance. let me know if you have any other feedback |
|
@luccabb could I get some feedback? I added the testing for HAL in the pr message. |
|
@claude can you review this pr? |
|
I'll analyze this and get back to you. |
|
@claude can you review this pr? |
|
Claude finished @luccabb's task in 1m 52s —— View job PR Review: add python accelerator HAL for multi-vendor backends
Overall, this is a clean, well-structured HAL abstraction. The module layout, error hierarchy, Protocol-based interface, and test coverage are all solid. A few issues need to be addressed before merging. Critical Issues1. In both def main(detach: bool, backend: str) -> None:
_ = backend # ← never forwarded anywhereThe option appears in 2. In model = client.get_device_by_index(index).get_vbios_version()
Minor Issues3. In def close(self) -> None:
self._client = None # ← just drops the reference; no shutdown call
return NoneIf 4. In 5. In @dataclass
class NVMLBackend(AcceleratorBackend):
Observations / Non-blocking
SummaryThe HAL design is solid and the tests are thorough. The main issue is that the |
|
@claude review my pr |
|
Claude finished @luccabb's task in 1m 8s —— View job PR Re-Review: add python accelerator HAL for multi-vendor backends
One Issue Blocking Merge
The PR removes these 5 lines from - name: Install Rust
run: |
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
cp -r $HOME/.cargo/bin/* $HOME/.cache/venv-ci/bin/This step was added to The author's explanation ("reverted it since I removed the related code from this PR") suggests they added this step in their branch and are now reverting their own addition — but this step was already on Fix: Restore those 5 lines so the workflow file is identical to Everything else from all prior reviews remains fixed. This is the only remaining change needed before merge. |
|
@luccabb added back the rust dep |
8694aef to
5e0fbb9
Compare
There was a problem hiding this comment.
you want this PR to be in a state where (1) ALL existing NVML paths from monitoring and health checks are going through the new accelerator interface and behavior is the same as before the PR. After (1) is accomplished add test plan that exercises every code path that was updated with before and after sections to prove that its the same functionality.
5e0fbb9 to
d694a52
Compare
d694a52 to
3989d0d
Compare
3989d0d to
ccc9129
Compare
addressed |
|
@luccabb could i get some feedback on this pr |
|
@luccabb it should be ready to merge |
|
@luccabb will close this pr. |

Introduce a hardware-agnostic accelerator abstraction layer with normalized metrics, backend management, and runtime probing. Includes a functional NVML backend plus ROCm/LevelZero/TPU/Neuron scaffolds and dedicated HAL tests.
Adds a Python-first hardware-agnostic accelerator HAL at gcm/monitoring/accelerator.
Decouples telemetry collection from NVML-only assumptions via a common backend interface and normalized metrics.
Implements functional NVMLBackend; adds scaffold backends for ROCm, LevelZero, TPU, and Neuron
Implements Feature Request #74
Test Plan:
Ran HAL tests:
12 passed