Skip to content

Conversation

@therealnb
Copy link
Contributor

@therealnb therealnb commented Oct 23, 2025

Description

Fixes #2274 and #2305 (as tested by @danbarr )

When running thv restart on an already-running workload, the command would return early with "Container is already running" but would NOT stop the old supervisor process. This caused supervisor processes to accumulate over time, leading to "zombie" processes.

Root Cause

The restartContainerWorkload and restartRemoteWorkload functions had early return logic when the workload status was already "running":

if err == nil && workload.Status == rt.WorkloadStatusRunning {
    logger.Infof("Container %s is already running", containerName)
    return nil  // ❌ Early return - supervisor process not stopped!
}

This meant:

  • The old supervisor process (thv restart <name> --foreground) was never killed
  • A new supervisor process was never started
  • Multiple restart calls would leave multiple orphaned supervisor processes

Solution

The fix ensures restart always performs a proper stop→start cycle when the workload is already running:

  1. Stop old supervisor (kills the proxy process that manages HTTP/SSE translation)
  2. Stop container if it's running
  3. Clean up client configurations
  4. Start fresh supervisor + container

Key Changes

  • restartContainerWorkload now checks if workload is running and stops it first
  • restartRemoteWorkload applies the same logic for remote workloads
  • Handles edge cases: dead supervisor, dead container, or both
  • System is resilient - stopProcess gracefully handles missing PIDs

Testing

✅ All unit tests pass (24/24)
✅ Added new unit test: TestDefaultManager_restartLogicConsistency verifies stop logic is called
✅ Added E2E test: test/e2e/restart_zombie_test.go verifies no process accumulation
✅ Linting passes (0 issues)

Recovery Scenarios

The fix handles all failure scenarios:

Scenario Recovery
Supervisor killed, container running ✅ Detects running container, gracefully handles missing PID, stops container, starts fresh
Container killed, supervisor running ✅ Supervisor auto-exits when container dies, restart starts fresh
Both killed ✅ No cleanup needed, directly starts fresh
Normal restart ✅ Our fix! Stops both cleanly, starts fresh

Breaking Changes

None - this is a bug fix that makes restart behave as expected.

Checklist

  • Code changes
  • Unit tests added/updated
  • E2E tests added
  • Linting passes
  • All tests pass
  • Documentation (excluded from commit per instructions)

Fixes #2274

When running 'thv restart' on an already-running workload, the command
would return early with 'Container is already running' but would NOT
stop the old supervisor process. This caused supervisor processes to
accumulate over time.

Changes:
- restartContainerWorkload now always stops supervisor + container when
  workload is already running
- restartRemoteWorkload applies the same logic for remote workloads
- Handles edge cases: dead supervisor, dead container, or both

The fix ensures restart always performs a proper stop→start cycle:
1. Stops old supervisor (kills proxy process)
2. Stops container if running
3. Cleans up client configurations
4. Starts fresh supervisor + container

System is resilient - stopProcess gracefully handles missing PIDs,
and supervisor auto-exits when container dies.

Added comprehensive unit tests and E2E test to verify no process
accumulation occurs on multiple restarts.
@therealnb therealnb added bug Something isn't working proxy cli Changes that impact CLI functionality labels Oct 23, 2025
@therealnb therealnb requested a review from amirejaz October 23, 2025 17:44
@codecov
Copy link

codecov bot commented Oct 23, 2025

Codecov Report

❌ Patch coverage is 55.55556% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.11%. Comparing base (915c1e5) to head (8cec774).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/workloads/manager.go 55.55% 11 Missing and 9 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2306      +/-   ##
==========================================
+ Coverage   54.01%   54.11%   +0.10%     
==========================================
  Files         239      239              
  Lines       23366    23400      +34     
==========================================
+ Hits        12620    12664      +44     
+ Misses       9556     9538      -18     
- Partials     1190     1198       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@therealnb therealnb mentioned this pull request Oct 23, 2025
@therealnb therealnb enabled auto-merge (squash) October 23, 2025 17:52
@therealnb therealnb requested review from JAORMX and danbarr October 23, 2025 17:55
@JAORMX
Copy link
Collaborator

JAORMX commented Oct 24, 2025

@claude can you review this and ensure we don't have any regressions?

@claude
Copy link
Contributor

claude bot commented Oct 24, 2025

Claude finished @JAORMX's task —— View job


Code Review for PR #2306

I've completed my review of this PR that fixes zombie supervisor processes on restart. Here's my comprehensive analysis:


Overall Assessment: LGTM with minor observations

The fix correctly addresses the root cause and implements a proper stop→start cycle for the restart operation. I found no regressions and the solution is well-tested.


📋 Key Changes Analysis

1. Core Fix in pkg/workloads/manager.go

Lines 817-833 (restartRemoteWorkload):

// If workload is already running, we need to stop it first (including the supervisor process)
// This ensures restart performs a proper stop->start cycle and doesn't leave zombie supervisors
if err == nil && workload.Status == rt.WorkloadStatusRunning {
    logger.Infof("Remote workload %s is already running, stopping it before restart", name)
    
    // Set status to stopping
    if err := d.statuses.SetWorkloadStatus(ctx, name, rt.WorkloadStatusStopping, ""); err != nil {
        logger.Debugf("Failed to set workload %s status to stopping: %v", name, err)
    }
    
    // Stop the supervisor process (proxy) first
    // This is critical to avoid leaving orphaned supervisor processes
    d.stopProxyIfNeeded(ctx, name, runConfig.BaseName)
    
    // Clean up client configurations
    if err := removeClientConfigurations(name, false); err != nil {
        logger.Warnf("Warning: Failed to remove client configurations: %v", err)
    }
}

Excellent: The fix correctly stops the supervisor process first before restarting, preventing zombie accumulation.

Lines 883-924 (restartContainerWorkload):
Similar logic applied to container workloads with additional handling for inconsistent states:

// Check if we need to stop the workload before restarting
shouldStop := false
if err == nil && workload.Status == rt.WorkloadStatusRunning {
    // Workload status shows running
    shouldStop = true
} else if container.IsRunning() {
    // Container is running but status is not running (inconsistent state)
    shouldStop = true
}

Good: Handles edge case where container is running but status doesn't reflect it.


🔍 Regression Analysis

I checked for potential regressions in the following areas:

1. Restart Behavior Changes

  • Before: Early return when workload already running → zombie processes accumulate
  • After: Always performs stop→start cycle → proper cleanup
  • Risk: None. This is the intended behavior fix.

2. Error Handling

The code properly handles all failure scenarios:

  • Missing PID files: stopProcess gracefully handles via GetWorkloadPID error checking (line 617-620)
  • Container not running: Warnings logged, doesn't fail operation (line 261-263, 306-307)
  • Status update failures: Logged as warnings, operation continues

3. State Consistency

The fix maintains proper state transitions:

  1. Running → Stopping (status update)
  2. Stop supervisor process
  3. Stop container
  4. Clean up client configs
  5. Starting → Running (on restart)

4. Backward Compatibility

  • No changes to public API
  • No changes to command-line interface
  • Existing workloads can be restarted normally

🧪 Test Coverage Analysis

Unit Tests (pkg/workloads/manager_test.go)

New test added: TestDefaultManager_restartLogicConsistency (lines 726-806)

  • Verifies stop logic is called when workload is running
  • Tests both remote and container workloads
  • Uses mock expectations to verify correct method calls

Existing tests updated:

  • Lines 580-586: Updated to expect stop logic calls
  • Lines 664-669: Updated for container restart expectations
  • Proper use of .AnyTimes() for flexible mock expectations

E2E Test (test/e2e/restart_zombie_test.go)

Excellent E2E coverage:

  • Tests actual process counting using ps aux
  • Verifies exactly 1 supervisor process before restart
  • Verifies exactly 1 supervisor process after first restart
  • Verifies exactly 1 supervisor process after second restart
  • Validates server remains functional

Observation: The E2E test uses a 2-second sleep (lines 63, 79) to allow processes to stabilize. This is reasonable but could potentially be flaky in slow environments.


📊 Code Quality

Strengths:

  1. ✅ Clear inline comments explaining the fix (lines 815-816, 825-826, 894-895, 903-904)
  2. ✅ Proper error handling with graceful degradation
  3. ✅ Consistent logging for debugging
  4. ✅ Well-structured tests with clear expectations

Minor Observations:

1. Cyclomatic Complexity (line 855: //nolint:gocyclo)
The restartContainerWorkload function has high cyclomatic complexity. While justified by the comment, consider future refactoring to extract:

  • Status checking logic
  • Stop decision logic
  • Container vs workload name resolution

2. Code Duplication
Lines 817-833 (restartRemoteWorkload) and lines 895-924 (restartContainerWorkload) have similar stop logic. This is acceptable since they handle different workload types, but watch for future opportunities to extract common patterns.

3. Test Isolation
The E2E test at line 63 and 79 uses time.Sleep(2 * time.Second). Consider if polling with a timeout would be more robust:

Eventually(func() int {
    return countSupervisorProcesses(serverName)
}).WithTimeout(5*time.Second).WithPolling(500*time.Millisecond).Should(Equal(1))

🎯 Specific Code Checks

Checked for common issues:

Race conditions: None found. Operations are sequential within each restart call.

Resource leaks: Properly cleaned up:

  • Supervisor processes stopped
  • Client configs removed
  • Status updated

Error propagation: Correct. Failures in stop logic are logged but don't prevent restart attempt.

PID file handling: Safe. The code handles missing PIDs gracefully (line 617-620).


📝 Coverage Report Note

Codecov shows 41.66% patch coverage with 14 lines missing. Looking at the changed lines, the uncovered lines are likely:

  • Error handling paths (logger.Warnf, logger.Debugf)
  • Edge case conditions (inconsistent state checks)

This is acceptable for this type of change. The critical path (stop→start cycle) is tested by the E2E test.


🚀 Recommendations

For this PR:

  1. Approve and merge - The fix is correct and well-tested
  2. 💡 Optional: Consider adding Eventually polling in E2E test instead of sleep

For future:

  1. Consider extracting common stop logic to reduce duplication between remote and container restart
  2. Monitor for any edge cases in production
  3. Consider refactoring restartContainerWorkload to reduce complexity when touching this code again

Final Verdict

No regressions found. The PR correctly fixes the zombie supervisor process issue by ensuring restart always performs a proper stop→start cycle. The implementation is defensive, handles edge cases well, and includes comprehensive testing.

Recommendation: Approve and merge


Job Run | Branch: fix/zombie-supervisor-processes

Copy link
Contributor

@amirejaz amirejaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is if the workload is already running whether it is local or remote, we should not let it restart. For the restart to work properly, it is required the workload to be stopped first using UI or thv stop.

@JAORMX pelase correct me if I'm wrong.

@therealnb
Copy link
Contributor Author

@amirejaz , you are correct and we can probably fix some of the error without restarting each time.
@danbarr had said that the new behaviour was preferred (there were issues previously reported).

We need to decide how we want it to work.

Previously, stop and restart were broken, only rm worked.

@danbarr
Copy link
Collaborator

danbarr commented Oct 24, 2025

This is an "IMHO" thing; I know it was discussed quite some time ago, but to me "restart" should mean...restart. Not "repair", or "restart but only if you think something's wrong"?

I feel like this aligns better with user expectations of what a "restart" means, also lines up with the behavior of the same command in docker/podman CLIs (which the thv cli is largely modeled after).

But also understand that it was decided back then to keep it the way it was, and there might be knock-on effects to changing it.

This change modifies the restart behavior to be smarter about when it
actually restarts a workload:

1. If workload is running AND supervisor is healthy (PID exists):
   - Return early without restarting (preserves old behavior)
   - This is a no-op, preventing unnecessary restarts

2. If workload is running BUT supervisor is dead (no PID):
   - Clean up and restart to fix the damaged state
   - This fixes the zombie supervisor process issue

Key changes:
- Added isSupervisorProcessAlive() to check if supervisor PID exists
- Updated restartRemoteWorkload() to check supervisor health first
- Updated restartContainerWorkload() to check supervisor health first
- Updated unit tests to cover both healthy and damaged scenarios
- Renamed test to reflect new behavior (health check vs always restart)

This approach:
✓ Preserves old behavior (no restart when healthy)
✓ Fixes zombie process issue (cleanup when damaged)
✓ Makes stop and restart commands work in all scenarios
✓ All 24 unit tests passing

Addresses feedback in #2306 to preserve old behavior while fixing
the damaged proxy issue.
@JAORMX
Copy link
Collaborator

JAORMX commented Oct 24, 2025

@danbarr yeah, we should fix restart to actually restart. But that involves reworking the codebase with that assumption and making sure we don't break existing functionality. Which is a lot more involved.

- Remove trailing whitespace in manager.go
- Fix test case indentation in manager_test.go

These formatting issues were automatically detected and fixed by
golangci-lint during the pre-push validation.
@therealnb
Copy link
Contributor Author

@amirejaz the code I have checked in is much more like the original behaviour, except with the bugs fixed. I have tested locally

Behavior Comparison:

State Old (Broken) PR Original New (Fixed)
Healthy + restart No-op ✅ Always restart ❌ No-op ✅
Damaged + restart No-op (zombie!) ❌ Always restart ✅ Cleanup + restart ✅
Stopped + restart Start ✅ Start ✅ Start ✅

Thanks for being diligent!

@therealnb therealnb requested a review from amirejaz October 24, 2025 16:05
Addresses @amirejaz's feedback to explicitly set workload status to
'stopped' after cleanup completes but before restarting.

This provides:
- Better state machine semantics (running → stopping → stopped → starting)
- Consistency with stopRemoteWorkload/stopContainerWorkload behavior
- Improved observability during restart operations
- Clearer indication that cleanup completed successfully

Changes:
- restartRemoteWorkload: Add SetWorkloadStatus(stopped) after cleanup
- restartContainerWorkload: Add SetWorkloadStatus(stopped) after cleanup

All tests pass - AnyTimes() expectations handle the new status call.
@therealnb therealnb merged commit 3fffdcc into main Oct 24, 2025
24 checks passed
@therealnb therealnb deleted the fix/zombie-supervisor-processes branch October 24, 2025 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cli Changes that impact CLI functionality proxy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

zombie issues

4 participants