fix: Ensure proper container cleanup in test_docs_end_to_end to prevent port conflicts #355

ilana-n · 2025-10-13T22:51:20Z

Problem

When running multiple server tests sequentially in test_docs_end_to_end, DCGM containers on port 9401 and vLLM containers on port 8000 weren't being properly cleaned up, causing "port already allocated" errors when the next test tried to start.

Root Cause

Dynamo's docker compose down returns before containers fully stop
vLLM containers have random names, so the wildcard stop by name did not find them and correctly stop them

Solution

Dynamo cleanup:
- Added 3-second sleep after docker compose down to ensure full cleanup of all related services including DCGM before proceeding
vLLM cleanup:
- Use grep vllm on image to find containers (instead of going by container name, which is randomly assigned)
- Explicitly stop dcgm-exporter by name

Each test now properly cleans up after itself, ensuring the next test starts with a clean environment.
Also, the vLLM GPU Telemetry documentation now runs successfully. Dynamo GPU Telemetry is still being debugged as tracked in this Linear ticket.

Gitlab

Successful pipeline job run here.

GPU Telemetry Documentation

Added tags to the vLLM instructions to be included in docs end-to-end CI tests.

Summary by CodeRabbit

Documentation
- Expanded GPU telemetry guide with new setup, health-check, and AIPerf benchmark sections, added closing markers and tips for vLLM/DCGM and default endpoint workflows.
Tests
- Improved test teardown to more reliably stop/remove DCGM exporter across server types and added a short sequencing delay to reduce shutdown flakiness.

coderabbitai · 2025-10-13T22:51:28Z

Walkthrough

Adds documentation block markers to the GPU telemetry tutorial and augments CI shutdown logic to explicitly stop/remove DCGM exporter containers across Dynamo, vLLM, and generic servers; adds a 3s sleep after Dynamo docker-compose down. No public APIs changed.

Changes

Cohort / File(s)	Summary
Documentation markers for GPU telemetry tutorial `docs/tutorials/gpu-telemetry.md`	Inserted setup, health-check, and AIPerf run block markers plus corresponding closing markers and tips. Editorial-only changes; no runtime logic.
CI test shutdown enhancements `tests/ci/test_docs_end_to_end/test_runner.py`	Expanded shutdown sequences to explicitly stop/remove DCGM exporter containers for Dynamo, vLLM, and generic servers; changed vLLM container filtering to match image name containing `"vllm"`; added a 3s sleep after Dynamo docker-compose down.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant CI as CI Runner
  participant D as Docker
  participant DCGM as DCGM Exporter
  participant Dyn as Dynamo Server
  participant V as vLLM Server
  participant G as Generic Server

  rect rgba(230,245,255,0.5)
  note over CI: Dynamo shutdown path
  CI->>D: docker compose down (Dynamo)
  CI->>CI: sleep 3s
  CI->>D: stop/remove DCGM exporter
  end

  rect rgba(240,255,240,0.5)
  note over CI: vLLM shutdown path
  CI->>D: stop/remove DCGM exporter
  CI->>D: list containers where image name contains "vllm"
  CI->>D: stop/remove vLLM containers
  end

  rect rgba(255,245,230,0.5)
  note over CI: Generic server shutdown path
  CI->>D: server-specific stop/remove
  CI->>D: stop/remove DCGM exporter
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

Poem

I tap my paws on pipeline ground,
DCGM quieted, exporters found.
vLLM sleeps, Dynamo delays three,
Docs now show how GPUs speak to me.
Carrots and logs, tidy and neat. 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly describes the nature of the change by indicating a fix for container cleanup in the test_docs_end_to_end suite and explicitly states the goal of preventing port conflicts, directly reflecting the core issue and solution introduced by the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ilana/fix-test-docs-end-to-end

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2025-10-13T22:53:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6c8be3 and 894ec63.

📒 Files selected for processing (2)

docs/tutorials/gpu-telemetry.md (5 hunks)
tests/ci/test_docs_end_to_end/test_runner.py (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build (ubuntu-latest, 3.10)

tests/ci/test_docs_end_to_end/test_runner.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tests/ci/test_docs_end_to_end/test_runner.py (1)

356-361: Good fix: removed -q to allow .Image in format.

The vLLM cleanup now correctly inspects image names and extracts IDs. This addresses the earlier template error and ensures containers are actually stopped/removed.

🧹 Nitpick comments (3)

tests/ci/test_docs_end_to_end/test_runner.py (3)

324-331: Avoid magic sleeps; poll for shutdown readiness instead.

A fixed sleep 3 can be flaky across runners. Prefer polling until resources are actually released (e.g., port 9401) or containers exit.

Apply this diff to replace the fixed sleep with a short readiness loop:

-                        sleep 3
+                        # Wait up to 15s for DCGM exporter to release port 9401
+                        for i in $(seq 1 15); do
+                          if ! ss -ltn | awk "{print \$4}" | grep -q ":9401$"; then
+                            break
+                          fi
+                          sleep 1
+                        done

Please confirm this resolves intermittent “port already allocated” flakes on slower CI runners.

348-354: Make DCGM exporter detection robust; avoid ancestor wildcard.

--filter ancestor=*dcgm-exporter* may not match as intended; Docker’s ancestor filter doesn’t support globs reliably. Grep image/name instead (like you did for vLLM).

Apply this diff:

-                        docker ps --filter ancestor=*dcgm-exporter* --format "{{.ID}}" | xargs -r docker stop 2>/dev/null || true
-                        docker ps -aq --filter ancestor=*dcgm-exporter* | xargs -r docker rm 2>/dev/null || true
+                        docker ps --format "{{.ID}} {{.Image}} {{.Names}}" | grep -i dcgm-exporter | awk "{print \$1}" | xargs -r docker stop 2>/dev/null || true
+                        docker ps -a --format "{{.ID}} {{.Image}} {{.Names}}" | grep -i dcgm-exporter | awk "{print \$1}" | xargs -r docker rm 2>/dev/null || true

If you prefer filters, --filter name=dcgm-exporter is another option (matches substrings). Please verify in your environment.

374-378: Generic shutdown: add image/name-based DCGM cleanup as fallback.

If the container name differs from dcgm-exporter, the stop/rm by name won’t catch it. Mirror the vLLM approach.

Apply this diff:

                         echo "Stopping DCGM containers..."
                         docker stop dcgm-exporter 2>/dev/null || true
                         docker rm dcgm-exporter 2>/dev/null || true
+                        docker ps --format "{{.ID}} {{.Image}} {{.Names}}" | grep -i dcgm-exporter | awk "{print \$1}" | xargs -r docker stop 2>/dev/null || true
+                        docker ps -a --format "{{.ID}} {{.Image}} {{.Names}}" | grep -i dcgm-exporter | awk "{print \$1}" | xargs -r docker rm 2>/dev/null || true

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 894ec63 and 83e65ad.

📒 Files selected for processing (1)

tests/ci/test_docs_end_to_end/test_runner.py (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build (ubuntu-latest, 3.10)

…telemetry documentation

ganeshku1 · 2025-10-17T19:18:17Z

Added 3-second sleep after docker compose down to ensure full cleanup of all related services including DCGM before proceeding

@ilana-n Is this sufficient for all use cases, do we envision this to be a issue on different hardware,
Can you share if any references on finalizing this 3 seconds?

github-actions bot added the fix label Oct 13, 2025

coderabbitai bot reviewed Oct 13, 2025

View reviewed changes

tests/ci/test_docs_end_to_end/test_runner.py Show resolved Hide resolved

coderabbitai bot reviewed Oct 14, 2025

View reviewed changes

ilana-n force-pushed the ilana/fix-test-docs-end-to-end branch 2 times, most recently from 001c364 to 894ec63 Compare October 14, 2025 20:46

fix: fixed container stopping problem and added tags for vllm in gpu …

ef35e8c

…telemetry documentation

ilana-n force-pushed the ilana/fix-test-docs-end-to-end branch from 894ec63 to ef35e8c Compare October 14, 2025 20:48

ilana-n requested review from ganeshku1 and matthewkotila October 14, 2025 22:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Ensure proper container cleanup in test_docs_end_to_end to prevent port conflicts #355

fix: Ensure proper container cleanup in test_docs_end_to_end to prevent port conflicts #355

Uh oh!

ilana-n commented Oct 13, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 13, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 13, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

ganeshku1 commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Ensure proper container cleanup in test_docs_end_to_end to prevent port conflicts #355

Are you sure you want to change the base?

fix: Ensure proper container cleanup in test_docs_end_to_end to prevent port conflicts #355

Uh oh!

Conversation

ilana-n commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Gitlab

GPU Telemetry Documentation

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

codecov bot commented Oct 13, 2025

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ganeshku1 commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ilana-n commented Oct 13, 2025 •

edited

Loading

coderabbitai bot commented Oct 13, 2025 •

edited

Loading