fix: prevent zombie supervisor processes on restart #2306

therealnb · 2025-10-23T17:44:00Z

Description

Fixes #2274 and #2305 (as tested by @danbarr )

When running thv restart on an already-running workload, the command would return early with "Container is already running" but would NOT stop the old supervisor process. This caused supervisor processes to accumulate over time, leading to "zombie" processes.

Root Cause

The restartContainerWorkload and restartRemoteWorkload functions had early return logic when the workload status was already "running":

if err == nil && workload.Status == rt.WorkloadStatusRunning {
    logger.Infof("Container %s is already running", containerName)
    return nil  // ❌ Early return - supervisor process not stopped!
}

This meant:

The old supervisor process (thv restart <name> --foreground) was never killed
A new supervisor process was never started
Multiple restart calls would leave multiple orphaned supervisor processes

Solution

The fix ensures restart always performs a proper stop→start cycle when the workload is already running:

Stop old supervisor (kills the proxy process that manages HTTP/SSE translation)
Stop container if it's running
Clean up client configurations
Start fresh supervisor + container

Key Changes

restartContainerWorkload now checks if workload is running and stops it first
restartRemoteWorkload applies the same logic for remote workloads
Handles edge cases: dead supervisor, dead container, or both
System is resilient - stopProcess gracefully handles missing PIDs

Testing

✅ All unit tests pass (24/24)
✅ Added new unit test: TestDefaultManager_restartLogicConsistency verifies stop logic is called
✅ Added E2E test: test/e2e/restart_zombie_test.go verifies no process accumulation
✅ Linting passes (0 issues)

Recovery Scenarios

The fix handles all failure scenarios:

Scenario	Recovery
Supervisor killed, container running	✅ Detects running container, gracefully handles missing PID, stops container, starts fresh
Container killed, supervisor running	✅ Supervisor auto-exits when container dies, restart starts fresh
Both killed	✅ No cleanup needed, directly starts fresh
Normal restart	✅ Our fix! Stops both cleanly, starts fresh

Breaking Changes

None - this is a bug fix that makes restart behave as expected.

Checklist

Fixes #2274 When running 'thv restart' on an already-running workload, the command would return early with 'Container is already running' but would NOT stop the old supervisor process. This caused supervisor processes to accumulate over time. Changes: - restartContainerWorkload now always stops supervisor + container when workload is already running - restartRemoteWorkload applies the same logic for remote workloads - Handles edge cases: dead supervisor, dead container, or both The fix ensures restart always performs a proper stop→start cycle: 1. Stops old supervisor (kills proxy process) 2. Stops container if running 3. Cleans up client configurations 4. Starts fresh supervisor + container System is resilient - stopProcess gracefully handles missing PIDs, and supervisor auto-exits when container dies. Added comprehensive unit tests and E2E test to verify no process accumulation occurs on multiple restarts.

codecov · 2025-10-23T17:49:56Z

Codecov Report

❌ Patch coverage is 55.55556% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.11%. Comparing base (915c1e5) to head (8cec774).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/workloads/manager.go	55.55%	11 Missing and 9 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2306      +/-   ##
==========================================
+ Coverage   54.01%   54.11%   +0.10%     
==========================================
  Files         239      239              
  Lines       23366    23400      +34     
==========================================
+ Hits        12620    12664      +44     
+ Misses       9556     9538      -18     
- Partials     1190     1198       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

JAORMX · 2025-10-24T08:07:17Z

@claude can you review this and ensure we don't have any regressions?

claude · 2025-10-24T08:07:38Z

Claude finished @JAORMX's task —— View job

Code Review for PR #2306

I've completed my review of this PR that fixes zombie supervisor processes on restart. Here's my comprehensive analysis:

✅ Overall Assessment: LGTM with minor observations

The fix correctly addresses the root cause and implements a proper stop→start cycle for the restart operation. I found no regressions and the solution is well-tested.

📋 Key Changes Analysis

1. Core Fix in `pkg/workloads/manager.go`

Lines 817-833 (restartRemoteWorkload):

// If workload is already running, we need to stop it first (including the supervisor process)
// This ensures restart performs a proper stop->start cycle and doesn't leave zombie supervisors
if err == nil && workload.Status == rt.WorkloadStatusRunning {
    logger.Infof("Remote workload %s is already running, stopping it before restart", name)
    
    // Set status to stopping
    if err := d.statuses.SetWorkloadStatus(ctx, name, rt.WorkloadStatusStopping, ""); err != nil {
        logger.Debugf("Failed to set workload %s status to stopping: %v", name, err)
    }
    
    // Stop the supervisor process (proxy) first
    // This is critical to avoid leaving orphaned supervisor processes
    d.stopProxyIfNeeded(ctx, name, runConfig.BaseName)
    
    // Clean up client configurations
    if err := removeClientConfigurations(name, false); err != nil {
        logger.Warnf("Warning: Failed to remove client configurations: %v", err)
    }
}

✅ Excellent: The fix correctly stops the supervisor process first before restarting, preventing zombie accumulation.

Lines 883-924 (restartContainerWorkload):
Similar logic applied to container workloads with additional handling for inconsistent states:

// Check if we need to stop the workload before restarting
shouldStop := false
if err == nil && workload.Status == rt.WorkloadStatusRunning {
    // Workload status shows running
    shouldStop = true
} else if container.IsRunning() {
    // Container is running but status is not running (inconsistent state)
    shouldStop = true
}

✅ Good: Handles edge case where container is running but status doesn't reflect it.

🔍 Regression Analysis

I checked for potential regressions in the following areas:

✅ 1. Restart Behavior Changes

Before: Early return when workload already running → zombie processes accumulate
After: Always performs stop→start cycle → proper cleanup
Risk: None. This is the intended behavior fix.

✅ 2. Error Handling

The code properly handles all failure scenarios:

Missing PID files: stopProcess gracefully handles via GetWorkloadPID error checking (line 617-620)
Container not running: Warnings logged, doesn't fail operation (line 261-263, 306-307)
Status update failures: Logged as warnings, operation continues

✅ 3. State Consistency

The fix maintains proper state transitions:

Running → Stopping (status update)
Stop supervisor process
Stop container
Clean up client configs
Starting → Running (on restart)

✅ 4. Backward Compatibility

No changes to public API
No changes to command-line interface
Existing workloads can be restarted normally

🧪 Test Coverage Analysis

Unit Tests (`pkg/workloads/manager_test.go`)

✅ New test added: TestDefaultManager_restartLogicConsistency (lines 726-806)

Verifies stop logic is called when workload is running
Tests both remote and container workloads
Uses mock expectations to verify correct method calls

✅ Existing tests updated:

Lines 580-586: Updated to expect stop logic calls
Lines 664-669: Updated for container restart expectations
Proper use of .AnyTimes() for flexible mock expectations

E2E Test (`test/e2e/restart_zombie_test.go`)

✅ Excellent E2E coverage:

Tests actual process counting using ps aux
Verifies exactly 1 supervisor process before restart
Verifies exactly 1 supervisor process after first restart
Verifies exactly 1 supervisor process after second restart
Validates server remains functional

Observation: The E2E test uses a 2-second sleep (lines 63, 79) to allow processes to stabilize. This is reasonable but could potentially be flaky in slow environments.

📊 Code Quality

Strengths:

✅ Clear inline comments explaining the fix (lines 815-816, 825-826, 894-895, 903-904)
✅ Proper error handling with graceful degradation
✅ Consistent logging for debugging
✅ Well-structured tests with clear expectations

Minor Observations:

1. Cyclomatic Complexity (line 855: //nolint:gocyclo)
The restartContainerWorkload function has high cyclomatic complexity. While justified by the comment, consider future refactoring to extract:

Status checking logic
Stop decision logic
Container vs workload name resolution

2. Code Duplication
Lines 817-833 (restartRemoteWorkload) and lines 895-924 (restartContainerWorkload) have similar stop logic. This is acceptable since they handle different workload types, but watch for future opportunities to extract common patterns.

3. Test Isolation
The E2E test at line 63 and 79 uses time.Sleep(2 * time.Second). Consider if polling with a timeout would be more robust:

Eventually(func() int {
    return countSupervisorProcesses(serverName)
}).WithTimeout(5*time.Second).WithPolling(500*time.Millisecond).Should(Equal(1))

🎯 Specific Code Checks

Checked for common issues:

✅ Race conditions: None found. Operations are sequential within each restart call.

✅ Resource leaks: Properly cleaned up:

Supervisor processes stopped
Client configs removed
Status updated

✅ Error propagation: Correct. Failures in stop logic are logged but don't prevent restart attempt.

✅ PID file handling: Safe. The code handles missing PIDs gracefully (line 617-620).

📝 Coverage Report Note

Codecov shows 41.66% patch coverage with 14 lines missing. Looking at the changed lines, the uncovered lines are likely:

Error handling paths (logger.Warnf, logger.Debugf)
Edge case conditions (inconsistent state checks)

This is acceptable for this type of change. The critical path (stop→start cycle) is tested by the E2E test.

🚀 Recommendations

For this PR:

✅ Approve and merge - The fix is correct and well-tested
💡 Optional: Consider adding Eventually polling in E2E test instead of sleep

For future:

Consider extracting common stop logic to reduce duplication between remote and container restart
Monitor for any edge cases in production
Consider refactoring restartContainerWorkload to reduce complexity when touching this code again

✅ Final Verdict

No regressions found. The PR correctly fixes the zombie supervisor process issue by ensuring restart always performs a proper stop→start cycle. The implementation is defensive, handles edge cases well, and includes comprehensive testing.

Recommendation: Approve and merge ✅

Job Run | Branch: fix/zombie-supervisor-processes

amirejaz

My understanding is if the workload is already running whether it is local or remote, we should not let it restart. For the restart to work properly, it is required the workload to be stopped first using UI or thv stop.

@JAORMX pelase correct me if I'm wrong.

therealnb · 2025-10-24T14:08:19Z

@amirejaz , you are correct and we can probably fix some of the error without restarting each time.
@danbarr had said that the new behaviour was preferred (there were issues previously reported).

We need to decide how we want it to work.

Previously, stop and restart were broken, only rm worked.

danbarr · 2025-10-24T15:16:53Z

This is an "IMHO" thing; I know it was discussed quite some time ago, but to me "restart" should mean...restart. Not "repair", or "restart but only if you think something's wrong"?

I feel like this aligns better with user expectations of what a "restart" means, also lines up with the behavior of the same command in docker/podman CLIs (which the thv cli is largely modeled after).

But also understand that it was decided back then to keep it the way it was, and there might be knock-on effects to changing it.

This change modifies the restart behavior to be smarter about when it actually restarts a workload: 1. If workload is running AND supervisor is healthy (PID exists): - Return early without restarting (preserves old behavior) - This is a no-op, preventing unnecessary restarts 2. If workload is running BUT supervisor is dead (no PID): - Clean up and restart to fix the damaged state - This fixes the zombie supervisor process issue Key changes: - Added isSupervisorProcessAlive() to check if supervisor PID exists - Updated restartRemoteWorkload() to check supervisor health first - Updated restartContainerWorkload() to check supervisor health first - Updated unit tests to cover both healthy and damaged scenarios - Renamed test to reflect new behavior (health check vs always restart) This approach: ✓ Preserves old behavior (no restart when healthy) ✓ Fixes zombie process issue (cleanup when damaged) ✓ Makes stop and restart commands work in all scenarios ✓ All 24 unit tests passing Addresses feedback in #2306 to preserve old behavior while fixing the damaged proxy issue.

JAORMX · 2025-10-24T15:50:33Z

@danbarr yeah, we should fix restart to actually restart. But that involves reworking the codebase with that assumption and making sure we don't break existing functionality. Which is a lot more involved.

- Remove trailing whitespace in manager.go - Fix test case indentation in manager_test.go These formatting issues were automatically detected and fixed by golangci-lint during the pre-push validation.

therealnb · 2025-10-24T16:04:27Z

@amirejaz the code I have checked in is much more like the original behaviour, except with the bugs fixed. I have tested locally

Behavior Comparison:

State	Old (Broken)	PR Original	New (Fixed)
Healthy + `restart`	No-op ✅	Always restart ❌	No-op ✅
Damaged + `restart`	No-op (zombie!) ❌	Always restart ✅	Cleanup + restart ✅
Stopped + `restart`	Start ✅	Start ✅	Start ✅

Thanks for being diligent!

pkg/workloads/manager.go

@amirejaz

Addresses @amirejaz's feedback to explicitly set workload status to 'stopped' after cleanup completes but before restarting. This provides: - Better state machine semantics (running → stopping → stopped → starting) - Consistency with stopRemoteWorkload/stopContainerWorkload behavior - Improved observability during restart operations - Clearer indication that cleanup completed successfully Changes: - restartRemoteWorkload: Add SetWorkloadStatus(stopped) after cleanup - restartContainerWorkload: Add SetWorkloadStatus(stopped) after cleanup All tests pass - AnyTimes() expectations handle the new status call.

therealnb added bug Something isn't working proxy cli Changes that impact CLI functionality labels Oct 23, 2025

therealnb requested a review from amirejaz October 23, 2025 17:44

therealnb mentioned this pull request Oct 23, 2025

zombie issues #2274

Closed

therealnb enabled auto-merge (squash) October 23, 2025 17:52

therealnb requested review from JAORMX and danbarr October 23, 2025 17:55

therealnb added 2 commits October 24, 2025 10:12

Merge branch 'main' into fix/zombie-supervisor-processes

a815cff

Merge branch 'main' into fix/zombie-supervisor-processes

ded2b49

amirejaz requested changes Oct 24, 2025

View reviewed changes

Merge branch 'main' into fix/zombie-supervisor-processes

703b11d

therealnb added 2 commits October 24, 2025 16:36

Merge branch 'main' into fix/zombie-supervisor-processes

cabfaf8

therealnb added 2 commits October 24, 2025 17:01

chore: apply linter formatting fixes

e1cc171

- Remove trailing whitespace in manager.go - Fix test case indentation in manager_test.go These formatting issues were automatically detected and fixed by golangci-lint during the pre-push validation.

Merge branch 'main' into fix/zombie-supervisor-processes

6508c67

therealnb requested a review from amirejaz October 24, 2025 16:05

Merge branch 'main' into fix/zombie-supervisor-processes

73625e9

amirejaz reviewed Oct 24, 2025

View reviewed changes

pkg/workloads/manager.go Show resolved Hide resolved

pkg/workloads/manager.go Show resolved Hide resolved

amirejaz approved these changes Oct 24, 2025

View reviewed changes

therealnb merged commit 3fffdcc into main Oct 24, 2025
24 checks passed

therealnb deleted the fix/zombie-supervisor-processes branch October 24, 2025 16:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent zombie supervisor processes on restart #2306

fix: prevent zombie supervisor processes on restart #2306

therealnb commented Oct 23, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 23, 2025 •

edited

Loading

Uh oh!

JAORMX commented Oct 24, 2025

Uh oh!

claude bot commented Oct 24, 2025 •

edited

Loading

Uh oh!

amirejaz left a comment

Uh oh!

therealnb commented Oct 24, 2025

Uh oh!

danbarr commented Oct 24, 2025

Uh oh!

JAORMX commented Oct 24, 2025

Uh oh!

therealnb commented Oct 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: prevent zombie supervisor processes on restart #2306

fix: prevent zombie supervisor processes on restart #2306

Conversation

therealnb commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Root Cause

Solution

Key Changes

Testing

Recovery Scenarios

Breaking Changes

Checklist

Uh oh!

codecov bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JAORMX commented Oct 24, 2025

Uh oh!

claude bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review for PR #2306

✅ Overall Assessment: LGTM with minor observations

📋 Key Changes Analysis

1. Core Fix in pkg/workloads/manager.go

🔍 Regression Analysis

✅ 1. Restart Behavior Changes

✅ 2. Error Handling

✅ 3. State Consistency

✅ 4. Backward Compatibility

🧪 Test Coverage Analysis

Unit Tests (pkg/workloads/manager_test.go)

E2E Test (test/e2e/restart_zombie_test.go)

📊 Code Quality

Strengths:

Minor Observations:

🎯 Specific Code Checks

Checked for common issues:

📝 Coverage Report Note

🚀 Recommendations

For this PR:

For future:

✅ Final Verdict

Uh oh!

amirejaz left a comment

Choose a reason for hiding this comment

Uh oh!

therealnb commented Oct 24, 2025

Uh oh!

danbarr commented Oct 24, 2025

Uh oh!

JAORMX commented Oct 24, 2025

Uh oh!

therealnb commented Oct 24, 2025

Behavior Comparison:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

therealnb commented Oct 23, 2025 •

edited

Loading

codecov bot commented Oct 23, 2025 •

edited

Loading

claude bot commented Oct 24, 2025 •

edited

Loading

1. Core Fix in `pkg/workloads/manager.go`

Unit Tests (`pkg/workloads/manager_test.go`)

E2E Test (`test/e2e/restart_zombie_test.go`)