-
Notifications
You must be signed in to change notification settings - Fork 133
fix: prevent zombie supervisor processes on restart #2306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes #2274 When running 'thv restart' on an already-running workload, the command would return early with 'Container is already running' but would NOT stop the old supervisor process. This caused supervisor processes to accumulate over time. Changes: - restartContainerWorkload now always stops supervisor + container when workload is already running - restartRemoteWorkload applies the same logic for remote workloads - Handles edge cases: dead supervisor, dead container, or both The fix ensures restart always performs a proper stop→start cycle: 1. Stops old supervisor (kills proxy process) 2. Stops container if running 3. Cleans up client configurations 4. Starts fresh supervisor + container System is resilient - stopProcess gracefully handles missing PIDs, and supervisor auto-exits when container dies. Added comprehensive unit tests and E2E test to verify no process accumulation occurs on multiple restarts.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2306 +/- ##
==========================================
+ Coverage 54.01% 54.11% +0.10%
==========================================
Files 239 239
Lines 23366 23400 +34
==========================================
+ Hits 12620 12664 +44
+ Misses 9556 9538 -18
- Partials 1190 1198 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@claude can you review this and ensure we don't have any regressions? |
|
Claude finished @JAORMX's task —— View job Code Review for PR #2306I've completed my review of this PR that fixes zombie supervisor processes on restart. Here's my comprehensive analysis: ✅ Overall Assessment: LGTM with minor observationsThe fix correctly addresses the root cause and implements a proper stop→start cycle for the restart operation. I found no regressions and the solution is well-tested. 📋 Key Changes Analysis1. Core Fix in
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is if the workload is already running whether it is local or remote, we should not let it restart. For the restart to work properly, it is required the workload to be stopped first using UI or thv stop.
@JAORMX pelase correct me if I'm wrong.
|
This is an "IMHO" thing; I know it was discussed quite some time ago, but to me "restart" should mean...restart. Not "repair", or "restart but only if you think something's wrong"? I feel like this aligns better with user expectations of what a "restart" means, also lines up with the behavior of the same command in docker/podman CLIs (which the thv cli is largely modeled after). But also understand that it was decided back then to keep it the way it was, and there might be knock-on effects to changing it. |
This change modifies the restart behavior to be smarter about when it actually restarts a workload: 1. If workload is running AND supervisor is healthy (PID exists): - Return early without restarting (preserves old behavior) - This is a no-op, preventing unnecessary restarts 2. If workload is running BUT supervisor is dead (no PID): - Clean up and restart to fix the damaged state - This fixes the zombie supervisor process issue Key changes: - Added isSupervisorProcessAlive() to check if supervisor PID exists - Updated restartRemoteWorkload() to check supervisor health first - Updated restartContainerWorkload() to check supervisor health first - Updated unit tests to cover both healthy and damaged scenarios - Renamed test to reflect new behavior (health check vs always restart) This approach: ✓ Preserves old behavior (no restart when healthy) ✓ Fixes zombie process issue (cleanup when damaged) ✓ Makes stop and restart commands work in all scenarios ✓ All 24 unit tests passing Addresses feedback in #2306 to preserve old behavior while fixing the damaged proxy issue.
|
@danbarr yeah, we should fix restart to actually restart. But that involves reworking the codebase with that assumption and making sure we don't break existing functionality. Which is a lot more involved. |
- Remove trailing whitespace in manager.go - Fix test case indentation in manager_test.go These formatting issues were automatically detected and fixed by golangci-lint during the pre-push validation.
|
@amirejaz the code I have checked in is much more like the original behaviour, except with the bugs fixed. I have tested locally Behavior Comparison:
Thanks for being diligent! |
Addresses @amirejaz's feedback to explicitly set workload status to 'stopped' after cleanup completes but before restarting. This provides: - Better state machine semantics (running → stopping → stopped → starting) - Consistency with stopRemoteWorkload/stopContainerWorkload behavior - Improved observability during restart operations - Clearer indication that cleanup completed successfully Changes: - restartRemoteWorkload: Add SetWorkloadStatus(stopped) after cleanup - restartContainerWorkload: Add SetWorkloadStatus(stopped) after cleanup All tests pass - AnyTimes() expectations handle the new status call.
Description
Fixes #2274 and #2305 (as tested by @danbarr )
When running
thv restarton an already-running workload, the command would return early with "Container is already running" but would NOT stop the old supervisor process. This caused supervisor processes to accumulate over time, leading to "zombie" processes.Root Cause
The
restartContainerWorkloadandrestartRemoteWorkloadfunctions had early return logic when the workload status was already "running":This meant:
thv restart <name> --foreground) was never killedSolution
The fix ensures
restartalways performs a proper stop→start cycle when the workload is already running:Key Changes
restartContainerWorkloadnow checks if workload is running and stops it firstrestartRemoteWorkloadapplies the same logic for remote workloadsstopProcessgracefully handles missing PIDsTesting
✅ All unit tests pass (24/24)
✅ Added new unit test:
TestDefaultManager_restartLogicConsistencyverifies stop logic is called✅ Added E2E test:
test/e2e/restart_zombie_test.goverifies no process accumulation✅ Linting passes (0 issues)
Recovery Scenarios
The fix handles all failure scenarios:
Breaking Changes
None - this is a bug fix that makes
restartbehave as expected.Checklist