Skip to content

Conversation

@ilana-n
Copy link
Contributor

@ilana-n ilana-n commented Oct 13, 2025

Problem

When running multiple server tests sequentially in test_docs_end_to_end, DCGM containers on port 9401 and vLLM containers on port 8000 weren't being properly cleaned up, causing "port already allocated" errors when the next test tried to start.

Root Cause

  • Dynamo's docker compose down returns before containers fully stop
  • vLLM containers have random names, so the wildcard stop by name did not find them and correctly stop them

Solution

  1. Dynamo cleanup:

    • Added 3-second sleep after docker compose down to ensure full cleanup of all related services including DCGM before proceeding
  2. vLLM cleanup:

    • Use grep vllm on image to find containers (instead of going by container name, which is randomly assigned)
    • Explicitly stop dcgm-exporter by name

Each test now properly cleans up after itself, ensuring the next test starts with a clean environment.
Also, the vLLM GPU Telemetry documentation now runs successfully. Dynamo GPU Telemetry is still being debugged as tracked in this Linear ticket.

Gitlab

Successful pipeline job run here.

GPU Telemetry Documentation

Added tags to the vLLM instructions to be included in docs end-to-end CI tests.

Summary by CodeRabbit

  • Documentation

    • Expanded GPU telemetry guide with new setup, health-check, and AIPerf benchmark sections, added closing markers and tips for vLLM/DCGM and default endpoint workflows.
  • Tests

    • Improved test teardown to more reliably stop/remove DCGM exporter across server types and added a short sequencing delay to reduce shutdown flakiness.

@coderabbitai
Copy link

coderabbitai bot commented Oct 13, 2025

Walkthrough

Adds documentation block markers to the GPU telemetry tutorial and augments CI shutdown logic to explicitly stop/remove DCGM exporter containers across Dynamo, vLLM, and generic servers; adds a 3s sleep after Dynamo docker-compose down. No public APIs changed.

Changes

Cohort / File(s) Summary
Documentation markers for GPU telemetry tutorial
docs/tutorials/gpu-telemetry.md
Inserted setup, health-check, and AIPerf run block markers plus corresponding closing markers and tips. Editorial-only changes; no runtime logic.
CI test shutdown enhancements
tests/ci/test_docs_end_to_end/test_runner.py
Expanded shutdown sequences to explicitly stop/remove DCGM exporter containers for Dynamo, vLLM, and generic servers; changed vLLM container filtering to match image name containing "vllm"; added a 3s sleep after Dynamo docker-compose down.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant CI as CI Runner
  participant D as Docker
  participant DCGM as DCGM Exporter
  participant Dyn as Dynamo Server
  participant V as vLLM Server
  participant G as Generic Server

  rect rgba(230,245,255,0.5)
  note over CI: Dynamo shutdown path
  CI->>D: docker compose down (Dynamo)
  CI->>CI: sleep 3s
  CI->>D: stop/remove DCGM exporter
  end

  rect rgba(240,255,240,0.5)
  note over CI: vLLM shutdown path
  CI->>D: stop/remove DCGM exporter
  CI->>D: list containers where image name contains "vllm"
  CI->>D: stop/remove vLLM containers
  end

  rect rgba(255,245,230,0.5)
  note over CI: Generic server shutdown path
  CI->>D: server-specific stop/remove
  CI->>D: stop/remove DCGM exporter
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

Poem

I tap my paws on pipeline ground,
DCGM quieted, exporters found.
vLLM sleeps, Dynamo delays three,
Docs now show how GPUs speak to me.
Carrots and logs, tidy and neat. 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly describes the nature of the change by indicating a fix for container cleanup in the test_docs_end_to_end suite and explicitly states the goal of preventing port conflicts, directly reflecting the core issue and solution introduced by the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ilana/fix-test-docs-end-to-end

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the fix label Oct 13, 2025
@codecov
Copy link

codecov bot commented Oct 13, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6c8be3 and 894ec63.

📒 Files selected for processing (2)
  • docs/tutorials/gpu-telemetry.md (5 hunks)
  • tests/ci/test_docs_end_to_end/test_runner.py (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build (ubuntu-latest, 3.10)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tests/ci/test_docs_end_to_end/test_runner.py (1)

356-361: Good fix: removed -q to allow .Image in format.

The vLLM cleanup now correctly inspects image names and extracts IDs. This addresses the earlier template error and ensures containers are actually stopped/removed.

🧹 Nitpick comments (3)
tests/ci/test_docs_end_to_end/test_runner.py (3)

324-331: Avoid magic sleeps; poll for shutdown readiness instead.

A fixed sleep 3 can be flaky across runners. Prefer polling until resources are actually released (e.g., port 9401) or containers exit.

Apply this diff to replace the fixed sleep with a short readiness loop:

-                        sleep 3
+                        # Wait up to 15s for DCGM exporter to release port 9401
+                        for i in $(seq 1 15); do
+                          if ! ss -ltn | awk "{print \$4}" | grep -q ":9401$"; then
+                            break
+                          fi
+                          sleep 1
+                        done

Please confirm this resolves intermittent “port already allocated” flakes on slower CI runners.


348-354: Make DCGM exporter detection robust; avoid ancestor wildcard.

--filter ancestor=*dcgm-exporter* may not match as intended; Docker’s ancestor filter doesn’t support globs reliably. Grep image/name instead (like you did for vLLM).

Apply this diff:

-                        docker ps --filter ancestor=*dcgm-exporter* --format "{{.ID}}" | xargs -r docker stop 2>/dev/null || true
-                        docker ps -aq --filter ancestor=*dcgm-exporter* | xargs -r docker rm 2>/dev/null || true
+                        docker ps --format "{{.ID}} {{.Image}} {{.Names}}" | grep -i dcgm-exporter | awk "{print \$1}" | xargs -r docker stop 2>/dev/null || true
+                        docker ps -a --format "{{.ID}} {{.Image}} {{.Names}}" | grep -i dcgm-exporter | awk "{print \$1}" | xargs -r docker rm 2>/dev/null || true

If you prefer filters, --filter name=dcgm-exporter is another option (matches substrings). Please verify in your environment.


374-378: Generic shutdown: add image/name-based DCGM cleanup as fallback.

If the container name differs from dcgm-exporter, the stop/rm by name won’t catch it. Mirror the vLLM approach.

Apply this diff:

                         echo "Stopping DCGM containers..."
                         docker stop dcgm-exporter 2>/dev/null || true
                         docker rm dcgm-exporter 2>/dev/null || true
+                        docker ps --format "{{.ID}} {{.Image}} {{.Names}}" | grep -i dcgm-exporter | awk "{print \$1}" | xargs -r docker stop 2>/dev/null || true
+                        docker ps -a --format "{{.ID}} {{.Image}} {{.Names}}" | grep -i dcgm-exporter | awk "{print \$1}" | xargs -r docker rm 2>/dev/null || true
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 894ec63 and 83e65ad.

📒 Files selected for processing (1)
  • tests/ci/test_docs_end_to_end/test_runner.py (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build (ubuntu-latest, 3.10)

@ilana-n ilana-n force-pushed the ilana/fix-test-docs-end-to-end branch 2 times, most recently from 001c364 to 894ec63 Compare October 14, 2025 20:46
@ilana-n ilana-n force-pushed the ilana/fix-test-docs-end-to-end branch from 894ec63 to ef35e8c Compare October 14, 2025 20:48
@ganeshku1
Copy link
Member

Added 3-second sleep after docker compose down to ensure full cleanup of all related services including DCGM before proceeding

@ilana-n Is this sufficient for all use cases, do we envision this to be a issue on different hardware,
Can you share if any references on finalizing this 3 seconds?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants