fix: GPU telemetry endpoint reachability logging + fix tests that weren't asserting properly #397

ilana-n · 2025-10-24T18:02:31Z

GPU telemetry endpoint reachability logging

Problem

When the default DCGM endpoints (localhost:9400 and localhost:9401) were reachable, the system logged "2/0 endpoints reachable" instead of "2/2 endpoints reachable".

Root Cause

The _compute_endpoints_for_display method in telemetry_manager.py returned an empty list when _user_explicitly_configured_telemetry was False, causing endpoints_configured to be empty even if endpoints_reachable contained 2 endpoints.

Fix

Updated the endpoint display logic to show reachable default endpoints regardless of configuration method:

If reachable defaults exist → always include them in display
Properly combine user-provided endpoints with reachable defaults
Only return empty list when no defaults are reachable AND user didn't configure telemetry

Testing

Added assertions to two tests to ensure we test for this proper logging: test_configure_sends_enabled_status_when_endpoints_reachable and test_configure_no_shutdown_when_no_endpoints_reachable
Also fixed some old tests that were using print statements instead of proper assertions

codecov · 2025-10-24T18:05:22Z

Codecov Report

❌ Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/aiperf/gpu_telemetry/telemetry_manager.py	83.33%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

…not actually asserting

…lector

ajcasagrande

LGTM!

ilana-n requested review from ajcasagrande and lkomali October 24, 2025 18:02

github-actions bot added the fix label Oct 24, 2025

ilana-n added 3 commits October 24, 2025 13:45

fix: the 2/0 problem when defaults reachable and fix tests that were …

cc6a495

…not actually asserting

fix: added assertions to tests to ensure proper gpu telemetry logging

22b6d08

fix: remove mocked waiting and sync callbacks from telemetry data col…

ec0cf4e

…lector

ilana-n force-pushed the ilana/fix-telemetry-endpoints-logs branch from a963b9d to ec0cf4e Compare October 24, 2025 20:48

fix: test

a44ccc4

ajcasagrande approved these changes Oct 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: GPU telemetry endpoint reachability logging + fix tests that weren't asserting properly #397

fix: GPU telemetry endpoint reachability logging + fix tests that weren't asserting properly #397

Uh oh!

ilana-n commented Oct 24, 2025

Uh oh!

codecov bot commented Oct 24, 2025 •

edited

Loading

Uh oh!

ajcasagrande left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix: GPU telemetry endpoint reachability logging + fix tests that weren't asserting properly #397

Are you sure you want to change the base?

fix: GPU telemetry endpoint reachability logging + fix tests that weren't asserting properly #397

Uh oh!

Conversation

ilana-n commented Oct 24, 2025

GPU telemetry endpoint reachability logging

Problem

Root Cause

Fix

Testing

Uh oh!

codecov bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ajcasagrande left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Oct 24, 2025 •

edited

Loading