-
Notifications
You must be signed in to change notification settings - Fork 3
fix: Ensure proper container cleanup in test_docs_end_to_end to prevent port conflicts #355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds documentation block markers to the GPU telemetry tutorial and augments CI shutdown logic to explicitly stop/remove DCGM exporter containers across Dynamo, vLLM, and generic servers; adds a 3s sleep after Dynamo docker-compose down. No public APIs changed. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CI as CI Runner
participant D as Docker
participant DCGM as DCGM Exporter
participant Dyn as Dynamo Server
participant V as vLLM Server
participant G as Generic Server
rect rgba(230,245,255,0.5)
note over CI: Dynamo shutdown path
CI->>D: docker compose down (Dynamo)
CI->>CI: sleep 3s
CI->>D: stop/remove DCGM exporter
end
rect rgba(240,255,240,0.5)
note over CI: vLLM shutdown path
CI->>D: stop/remove DCGM exporter
CI->>D: list containers where image name contains "vllm"
CI->>D: stop/remove vLLM containers
end
rect rgba(255,245,230,0.5)
note over CI: Generic server shutdown path
CI->>D: server-specific stop/remove
CI->>D: stop/remove DCGM exporter
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–30 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
docs/tutorials/gpu-telemetry.md(5 hunks)tests/ci/test_docs_end_to_end/test_runner.py(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build (ubuntu-latest, 3.10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/ci/test_docs_end_to_end/test_runner.py (1)
356-361: Good fix: removed-qto allow.Imagein format.The vLLM cleanup now correctly inspects image names and extracts IDs. This addresses the earlier template error and ensures containers are actually stopped/removed.
🧹 Nitpick comments (3)
tests/ci/test_docs_end_to_end/test_runner.py (3)
324-331: Avoid magic sleeps; poll for shutdown readiness instead.A fixed sleep 3 can be flaky across runners. Prefer polling until resources are actually released (e.g., port 9401) or containers exit.
Apply this diff to replace the fixed sleep with a short readiness loop:
- sleep 3 + # Wait up to 15s for DCGM exporter to release port 9401 + for i in $(seq 1 15); do + if ! ss -ltn | awk "{print \$4}" | grep -q ":9401$"; then + break + fi + sleep 1 + donePlease confirm this resolves intermittent “port already allocated” flakes on slower CI runners.
348-354: Make DCGM exporter detection robust; avoid ancestor wildcard.
--filter ancestor=*dcgm-exporter*may not match as intended; Docker’s ancestor filter doesn’t support globs reliably. Grep image/name instead (like you did for vLLM).Apply this diff:
- docker ps --filter ancestor=*dcgm-exporter* --format "{{.ID}}" | xargs -r docker stop 2>/dev/null || true - docker ps -aq --filter ancestor=*dcgm-exporter* | xargs -r docker rm 2>/dev/null || true + docker ps --format "{{.ID}} {{.Image}} {{.Names}}" | grep -i dcgm-exporter | awk "{print \$1}" | xargs -r docker stop 2>/dev/null || true + docker ps -a --format "{{.ID}} {{.Image}} {{.Names}}" | grep -i dcgm-exporter | awk "{print \$1}" | xargs -r docker rm 2>/dev/null || trueIf you prefer filters,
--filter name=dcgm-exporteris another option (matches substrings). Please verify in your environment.
374-378: Generic shutdown: add image/name-based DCGM cleanup as fallback.If the container name differs from
dcgm-exporter, the stop/rm by name won’t catch it. Mirror the vLLM approach.Apply this diff:
echo "Stopping DCGM containers..." docker stop dcgm-exporter 2>/dev/null || true docker rm dcgm-exporter 2>/dev/null || true + docker ps --format "{{.ID}} {{.Image}} {{.Names}}" | grep -i dcgm-exporter | awk "{print \$1}" | xargs -r docker stop 2>/dev/null || true + docker ps -a --format "{{.ID}} {{.Image}} {{.Names}}" | grep -i dcgm-exporter | awk "{print \$1}" | xargs -r docker rm 2>/dev/null || true
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/ci/test_docs_end_to_end/test_runner.py(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build (ubuntu-latest, 3.10)
001c364 to
894ec63
Compare
…telemetry documentation
894ec63 to
ef35e8c
Compare
@ilana-n Is this sufficient for all use cases, do we envision this to be a issue on different hardware, |
Problem
When running multiple server tests sequentially in
test_docs_end_to_end, DCGM containers on port 9401 and vLLM containers on port 8000 weren't being properly cleaned up, causing "port already allocated" errors when the next test tried to start.Root Cause
docker compose downreturns before containers fully stopSolution
Dynamo cleanup:
docker compose downto ensure full cleanup of all related services including DCGM before proceedingvLLM cleanup:
grep vllmon image to find containers (instead of going by container name, which is randomly assigned)Each test now properly cleans up after itself, ensuring the next test starts with a clean environment.
Also, the vLLM GPU Telemetry documentation now runs successfully. Dynamo GPU Telemetry is still being debugged as tracked in this Linear ticket.
Gitlab
Successful pipeline job run here.
GPU Telemetry Documentation
Added tags to the vLLM instructions to be included in docs end-to-end CI tests.
Summary by CodeRabbit
Documentation
Tests