Skip to content

refactor(worker): land Pending start semantics and state machine#1102

Merged
slin1237 merged 8 commits intomainfrom
refactor/worker-pending-state-machine
Apr 12, 2026
Merged

refactor(worker): land Pending start semantics and state machine#1102
slin1237 merged 8 commits intomainfrom
refactor/worker-pending-state-machine

Conversation

@slin1237
Copy link
Copy Markdown
Member

@slin1237 slin1237 commented Apr 12, 2026

Description

Problem

Workers start Ready (routable) immediately at registration, before any local health check has verified them. This causes the mesh-sync 500/503 bug where unverified workers receive traffic and fail. The WorkerStatus enum from #1093 defined four states but only two were actually used.

Solution

Workers with health checks enabled now start Pending (not routable). The health checker promotes them to Ready after success_threshold consecutive successful probes. Full state machine transitions implemented in check_health_async().

State Machine

             register()
                 │
        disable_health_check?
           /          \
        yes            no
         │              │
       Ready         Pending ──────┐
                        │          │
                        │          │ max_pending_probes (10× failure_threshold)
                        │          │ total attempts exceeded
               success_threshold   ▼
               consecutive       Failed
                    │
                    ▼
            ┌────Ready◄────────┐
            │      │           │
failure_threshold  │     success_threshold
  consecutive      │       consecutive
            │      ▼           │
            └─►NotReady────────┘
                    │
                    │ liveness_failure_threshold
                    │ (3× failure_threshold) consecutive
                    ▼
                 Failed

Changes

Production code:

  • worker.rs: added total_pending_probes counter on BasicWorker
  • builder.rs: initial status is Pending if health checks enabled, Ready otherwise
  • worker.rs: check_health_async() implements the full state machine (Pending→Ready, Ready→NotReady, NotReady→Failed, NotReady→Ready, Pending→Failed)
  • registry.rs: --remove-unhealthy-workers only removes Failed (not Pending or NotReady)
  • workflow/steps/shared/activate.rs: ActivateWorkersStep only activates workers with disable_health_check == true
  • workflow/steps/local/create_worker.rs & external/create_workers.rs: removed redundant set_healthy calls (builder handles initial status)

Thresholds:

  • max_pending_probes = 10 × failure_threshold — prevents misconfigured URLs from leaking in Pending forever
  • liveness_failure_threshold = 3 × failure_threshold — prevents aggressive removal of transiently unhealthy workers (analogous to K8s having separate readiness and liveness probes)

Tests:

  • Added no_health_check() helper in test modules
  • Updated ~80 test builders across policies/*, routers/http/router.rs, and integration tests to use disable_health_check: true
  • Integration test helpers (common/mod.rs, test_app.rs) default to disable_health_check: true
  • Added test_pending_worker_not_routable to verify new behavior

What's NOT in this PR

  • StatusChanged events from the health path — deferred to PR 7 when WorkerManager owns the health loop
  • Mesh sync rolling upgrade handling — deferred to mesh v2 / WorkerSyncAdapter

Behavior changes

  • --remove-unhealthy-workers is now stricter: only Failed workers are removed. Workers in Pending or NotReady are left alone. This is a fix — the old behavior removed transiently unhealthy workers too aggressively.
  • Workers take check_interval_secs × success_threshold seconds to become routable after registration (default: ~30 seconds with 15s interval × 2 threshold). Workers with disable_health_check == true are Ready immediately.

Test Plan

  • cargo check --package smg --package smg-golang — clean
  • cargo test --package smg --lib — 512 tests pass
  • cargo test --package smg --test api_tests — 100 tests pass
  • cargo test --package smg --test routing_tests — 92 tests pass
  • All other integration test suites pass

Refs: worker-module-deep-refactor plan (PR 6b)

Release constraint: This PR must roll forward directly into PR 7 (WorkerManager extraction). PR 6b is an implementation staging PR — `StatusChanged` events become authoritative in PR 7.

Checklist
  • Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • Chores
    • Worker startup/activation now follow health-check config: builders set initial readiness; activation only marks health-check-disabled workers ready; health-checked workers start pending.
    • Worker health lifecycle and tracking improved (clearer status transitions, pending-probe counting); replacements preserve prior runtime status and removals only occur for truly failed workers.
  • Tests
    • Many tests and helpers updated to disable health checks where needed to stabilize startup, routing, and reliability tests.

@github-actions github-actions bot added tests Test changes model-gateway Model gateway crate changes labels Apr 12, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Worker health model changed: builder now controls initial WorkerStatus (Pending/Ready) and exposes a status setter; BasicWorker tracks pending probe counts and revised health-check transitions; registry removal logic now only removes terminal Failed status; various workflow steps and many tests updated to disable health checks in test setups.

Changes

Cohort / File(s) Summary
Core worker implementation
model_gateway/src/worker/worker.rs, model_gateway/src/worker/builder.rs
Introduce WorkerStatus-aware builder (status(...)), builder-driven initial status (Pending vs Ready), add total_pending_probes, update health-check state machine and probe counting; public builder method added.
Registry & lifecycle
model_gateway/src/worker/registry.rs
Change removal criteria to only remove workers with terminal WorkerStatus::Failed; preserve explicit status on replace; add regression tests for status preservation.
Workflow & activation steps
model_gateway/src/workflow/steps/external/create_workers.rs, model_gateway/src/workflow/steps/local/create_worker.rs, model_gateway/src/workflow/steps/shared/activate.rs, model_gateway/src/workflow/steps/local/update_worker_properties.rs
Stop mutating worker health after build; delegate initial status to builder; Activate step only marks workers ready when health checks are disabled and worker not already healthy; update replace to preserve existing status.
Policy tests
model_gateway/src/policies/*/*.rs (multiple policy files: cache_aware.rs, consistent_hashing.rs, manual.rs, mod.rs, power_of_two.rs, prefix_hash.rs, random.rs, registry.rs, round_robin.rs)
Test modules updated to import HealthCheckConfig and add no_health_check() helper; all test-built workers now call .health_config(no_health_check()) to disable health checks in tests.
Router tests & HTTP router
model_gateway/src/routers/http/router.rs, model_gateway/tests/*, model_gateway/tests/common/*, model_gateway/tests/api/*, model_gateway/tests/routing/*, model_gateway/tests/reliability/*
Many integration/unit tests changed to build mutable RouterConfig and set health_check.disable_health_check = true; test helpers updated to create external/mock workers with disabled health checks.
Metrics & tests alignment
various unit tests updated across worker and builder tests
Adjust tests to new default Pending behavior and updated expectations for status and metrics initialization.

Sequence Diagram(s)

sequenceDiagram
    participant Workflow as Workflow/CreateWorker
    participant Builder as BasicWorkerBuilder
    participant Worker as BasicWorker
    participant Registry as WorkerRegistry
    participant Health as HealthChecker

    Workflow->>Builder: configure metadata (+health_config, optional status)
    Builder->>Worker: build() -> set initial WorkerStatus (Ready if disable_health_check, else Pending)
    Worker->>Registry: register(worker snapshot)
    par async health lifecycle
        Health->>Worker: check_health_async()
        alt health checks disabled
            Worker-->>Registry: remain Ready (no probes)
        else enabled
            Worker-->>Worker: increment total_pending_probes if Pending
            Worker-->>Registry: update status transitions (Pending->Ready, Ready->NotReady, NotReady->Failed)
        end
    end
    Note right of Registry: removal only when status == Failed
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • CatherineSue
  • key4ng
  • whybeyoung

Poem

"I hopped through code at break of day,
Builders set statuses my way,
Pending or Ready, I beep and cheer,
Probes counted—no more fear 🐇,
Tests now pass, batteries near!"

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately and concisely summarizes the primary change: implementing Pending start semantics and a state machine for worker health checking.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/worker-pending-state-machine

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8aba682681

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean state machine implementation. Two nits on the recovery path semantics — both are about comment accuracy and a minor counter-reset inconsistency, not correctness bugs.

Summary: 0 🔴 Important · 2 🟡 Nit · 0 🟣 Pre-existing

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the boolean health status for workers with a state machine consisting of Pending, Ready, NotReady, and Failed states. Workers with health checks enabled now initialize as Pending and transition to Ready only after meeting a success threshold. The update also introduces logic for handling failures, including a transition to Failed after exceeding failure or probe thresholds. Feedback was provided regarding the Failed state, noting that the current implementation allows recovery to Ready, which contradicts the documentation suggesting Failed should be a terminal state.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/policies/bucket.rs`:
- Around line 570-575: The test helper no_health_check() is duplicated across
test modules; extract it into a shared test utilities module (e.g., a new
tests::helpers or crate::test_utils) and update tests to import and call that
single function instead of their local copies. Move the function signature and
return value (HealthCheckConfig with disable_health_check: true and
Default::default() for the rest) into the shared helper, ensure the module is
annotated for test-only use (cfg(test) or in tests/), and update references in
policies::bucket tests to use the centralized no_health_check helper.

In `@model_gateway/src/policies/mod.rs`:
- Around line 190-195: The test helper no_health_check() that returns a
HealthCheckConfig with disable_health_check: true is duplicated across policy
test modules; extract and centralize it into a shared test utilities module
(e.g., tests::utils or a new policies::test_helpers) and update callers to
import and use the single no_health_check function instead of local copies,
ensuring the signature and returned type (HealthCheckConfig) remain unchanged so
functions referencing no_health_check() continue to compile.

In `@model_gateway/src/worker/registry.rs`:
- Around line 996-1001: The worker eviction currently checks both
worker.status() == WorkerStatus::Failed and the boolean gate *failed from
checked_workers, which prevents removing workers that transitioned to Failed
without an Err probe; update the eviction logic in registry.rs (the loop
iterating over checked_workers when job_queue.is_some()) to only rely on
worker.status() == WorkerStatus::Failed (i.e., drop the extra *failed condition)
so that any worker in the terminal Failed state is removed when
--remove-unhealthy-workers is enabled.

In `@model_gateway/src/worker/worker.rs`:
- Around line 654-661: The Pending→Failed branch doesn't reset the
consecutive_failures counter, unlike the other failure transitions; after
detecting total >= max_pending and calling self.set_status(WorkerStatus::Failed)
in the WorkerStatus::Pending branch, also reset the counter (e.g.,
self.consecutive_failures.store(0, Ordering::Relaxed)) so consecutive_failures
is cleared consistently with the Ready→NotReady and NotReady→Failed transitions;
locate this in the block that computes max_pending from
health_config.failure_threshold and reads self.total_pending_probes.

In `@model_gateway/tests/common/mod.rs`:
- Around line 206-207: The test helper is forcibly disabling health checks by
setting config.health_check.disable_health_check = true, which prevents callers
from enabling health checks; change this so only AppTestContext::new() defaults
to disabling health checks, but do not mutate the passed config in
new_with_config() or the lower-level context builders—respect the incoming
config.health_check values instead (update new_with_config(), related builders,
and any helper locations around the places called out: the blocks near lines
206, 416-419, 555-558, 693-696) so tests that pass an enabled health_check can
exercise Pending/Failed semantics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d7bd91a0-dc16-48a9-b8f1-b5843873c930

📥 Commits

Reviewing files that changed from the base of the PR and between ae14a9c and 8aba682.

📒 Files selected for processing (23)
  • model_gateway/src/policies/bucket.rs
  • model_gateway/src/policies/cache_aware.rs
  • model_gateway/src/policies/consistent_hashing.rs
  • model_gateway/src/policies/manual.rs
  • model_gateway/src/policies/mod.rs
  • model_gateway/src/policies/power_of_two.rs
  • model_gateway/src/policies/prefix_hash.rs
  • model_gateway/src/policies/random.rs
  • model_gateway/src/policies/registry.rs
  • model_gateway/src/policies/round_robin.rs
  • model_gateway/src/routers/http/router.rs
  • model_gateway/src/worker/builder.rs
  • model_gateway/src/worker/registry.rs
  • model_gateway/src/worker/worker.rs
  • model_gateway/src/workflow/steps/external/create_workers.rs
  • model_gateway/src/workflow/steps/local/create_worker.rs
  • model_gateway/src/workflow/steps/shared/activate.rs
  • model_gateway/tests/common/mod.rs
  • model_gateway/tests/common/test_app.rs
  • model_gateway/tests/load_guard_raii_test.rs
  • model_gateway/tests/otel_tracing_test.rs
  • model_gateway/tests/routing/cache_aware_backward_compat_test.rs
  • model_gateway/tests/routing/test_openai_routing.rs

What changed:

Production code:
- worker.rs: added `total_pending_probes: Arc<AtomicUsize>` to BasicWorker
  for tracking total probe attempts in Pending state
- builder.rs: initializes workers as `Pending` when health checks are
  enabled, `Ready` when `disable_health_check == true`. Previously all
  workers started Ready regardless of config.
- worker.rs: `check_health_async()` implements the full state machine:
    Pending → Ready on `success_threshold` consecutive passes
    Pending → Failed on `max_pending_probes` (10× failure_threshold) total attempts
    Ready → NotReady on `failure_threshold` consecutive failures
    NotReady → Ready on `success_threshold` consecutive passes
    NotReady → Failed on `liveness_failure_threshold` (3× failure_threshold)
- registry.rs: `start_health_checker` only removes `Failed` workers
  (not Pending or NotReady) when `--remove-unhealthy-workers` is set
- workflow/steps/shared/activate.rs: ActivateWorkersStep only activates
  workers with `disable_health_check == true`. Health-checked workers
  stay Pending — the health checker promotes them after passing probes.
- workflow/steps/local/create_worker.rs: removed redundant set_healthy
  call (builder now handles initial status)
- workflow/steps/external/create_workers.rs: same removal

Tests:
- worker.rs: added `no_health_check()` helper and `test_pending_worker_not_routable`
- builder.rs: updated assertions to match new Pending-by-default behavior
- policies/*.rs: added `.health_config(no_health_check())` to ~73 test
  builders so policy tests get routable workers (they don't test health
  lifecycle, they test routing decisions)
- routers/http/router.rs: same fix for router tests
- tests/common/mod.rs: `new_with_config` sets `disable_health_check: true`
  so all integration tests get routable workers immediately
- tests/common/test_app.rs: same for external worker helper
- tests/routing/cache_aware_backward_compat_test.rs: added no_health_check
  helper and applied to 8 builders
- tests/routing/test_openai_routing.rs: added health config
- tests/load_guard_raii_test.rs: added health config
- tests/otel_tracing_test.rs: set disable_health_check on test config

Why:
Workers that haven't been verified locally should not receive traffic.
Starting Pending instead of Ready prevents the mesh-sync 500/503 bug
where unverified workers were assumed healthy. The state machine
(Pending → Ready → NotReady → Failed) mirrors Kubernetes pod conditions
and enables proper liveness separation from readiness.

How:
- `max_pending_probes = 10 × failure_threshold` guards against misconfigured
  URLs staying in Pending forever
- `liveness_failure_threshold = 3 × failure_threshold` prevents aggressive
  removal of transiently unhealthy workers (analogous to K8s having
  separate readiness and liveness probes)
- `StatusChanged` events are NOT yet emitted from the health path — this
  is deferred to PR 7 when WorkerManager owns the health loop
- `--remove-unhealthy-workers` now only removes Failed workers, not
  NotReady (which may recover) or Pending (which hasn't proven itself).
  This is a behavior fix, not a breaking change — the old behavior was
  a bug that removed transiently unhealthy workers too aggressively.

All 512 lib tests, 100 api tests, and 200+ other integration tests pass.

Refs: worker-module-deep-refactor plan (PR 6b)
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
@slin1237 slin1237 force-pushed the refactor/worker-pending-state-machine branch from 8aba682 to 53e5233 Compare April 12, 2026 04:30
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 53e5233f00

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
model_gateway/src/worker/registry.rs (1)

1000-1000: ⚠️ Potential issue | 🟠 Major

Drop the extra *failed gate from terminal Failed eviction.

Line 1000 currently requires both WorkerStatus::Failed and *failed. A worker can already be terminal Failed while the latest probe result is not Err, which prevents eviction and leaves a non-recoverable worker registered.

🛠️ Proposed fix
-                            if worker.status() == WorkerStatus::Failed && *failed {
+                            if worker.status() == WorkerStatus::Failed {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/worker/registry.rs` at line 1000, The eviction gate
currently requires both worker.status() == WorkerStatus::Failed and *failed,
which prevents eviction when a worker is already terminal Failed but the latest
probe flag (failed) is false; remove the extra probe-based check by changing the
conditional that references worker.status() and *failed to only check
worker.status() == WorkerStatus::Failed so terminal Failed workers are evicted
regardless of the transient *failed probe flag (adjust any surrounding logic
that depended on the combined condition in the same block to keep semantics for
non-Failed states).
model_gateway/tests/common/mod.rs (1)

206-207: ⚠️ Potential issue | 🟠 Major

Don’t override caller-supplied health-check config in shared test helpers.

new_with_config() now forces config.health_check.disable_health_check = true, and the lower-level context builders hardcode the same override into every test worker. That means any test that explicitly enables health checks can no longer exercise the new Pending -> Ready/Failed startup behavior this PR is introducing. It’s fine for AppTestContext::new() to default to disabled health checks for convenience, but new_with_config() and create_test_context*() should preserve the incoming config.

Suggested direction
 pub fn new(
     worker_configs: Vec<MockWorkerConfig>,
 ) -> Pin<Box<dyn Future<Output = Self> + Send>> {
     Box::pin(async move {
-        let config = RouterConfig::builder()
+        let mut config = RouterConfig::builder()
             .regular_mode(vec![])
             .random_policy()
             .host("127.0.0.1")
             .port(3002)
             .max_payload_size(256 * 1024 * 1024)
             .request_timeout_secs(600)
             .worker_startup_timeout_secs(1)
             .worker_startup_check_interval_secs(1)
             .max_concurrent_requests(64)
             .queue_timeout_secs(60)
             .build_unchecked();
+        config.health_check.disable_health_check = true;

         Self::new_with_config(config, worker_configs).await
     })
 }

 pub fn new_with_config(
-    mut config: RouterConfig,
+    config: RouterConfig,
     worker_configs: Vec<MockWorkerConfig>,
 ) -> Pin<Box<dyn Future<Output = Self> + Send>> {
-    // Test workers don't need health checks — start Ready immediately
-    config.health_check.disable_health_check = true;
-
     Box::pin(async move {
         ...
     })
 }

And in each create_test_context*() helper, pass through the caller's setting instead of hardcoding it:

+let health_config = config.health_check.clone();
 ...
-    .health_config(openai_protocol::worker::HealthCheckConfig {
-        disable_health_check: true,
-        ..Default::default()
-    })
+    .health_config(health_config.clone())

Also applies to: 416-419, 555-558, 693-696

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/tests/common/mod.rs` around lines 206 - 207, new_with_config()
and the create_test_context*() helpers are mutating the caller-supplied config
by forcing config.health_check.disable_health_check = true, which prevents tests
from exercising the Pending->Ready/Failed startup flow; remove that hardcoded
override from new_with_config() and from the lower-level create_test_context*()
builders so they preserve the incoming config, leaving only
AppTestContext::new() to default-disable health checks for convenience if no
config is provided; ensure any place that currently sets
config.health_check.disable_health_check explicitly instead respects the
passed-in config value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/workflow/steps/shared/activate.rs`:
- Around line 28-30: The Ready transition is being re-applied for workers with
disable_health_check by calling worker.set_healthy(true) in the activation loop;
instead either remove that call or guard it so you don't mutate
already-initialized status—update the block around
worker.metadata().health_config.disable_health_check to simply increment
activated (or only call worker.set_healthy(true) when the worker's current state
is not already Ready) and avoid redundant state mutations from
set_healthy(true).

---

Duplicate comments:
In `@model_gateway/src/worker/registry.rs`:
- Line 1000: The eviction gate currently requires both worker.status() ==
WorkerStatus::Failed and *failed, which prevents eviction when a worker is
already terminal Failed but the latest probe flag (failed) is false; remove the
extra probe-based check by changing the conditional that references
worker.status() and *failed to only check worker.status() ==
WorkerStatus::Failed so terminal Failed workers are evicted regardless of the
transient *failed probe flag (adjust any surrounding logic that depended on the
combined condition in the same block to keep semantics for non-Failed states).

In `@model_gateway/tests/common/mod.rs`:
- Around line 206-207: new_with_config() and the create_test_context*() helpers
are mutating the caller-supplied config by forcing
config.health_check.disable_health_check = true, which prevents tests from
exercising the Pending->Ready/Failed startup flow; remove that hardcoded
override from new_with_config() and from the lower-level create_test_context*()
builders so they preserve the incoming config, leaving only
AppTestContext::new() to default-disable health checks for convenience if no
config is provided; ensure any place that currently sets
config.health_check.disable_health_check explicitly instead respects the
passed-in config value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e122f3b6-20af-446b-9c25-1d39b0300802

📥 Commits

Reviewing files that changed from the base of the PR and between 8aba682 and 53e5233.

📒 Files selected for processing (23)
  • model_gateway/src/policies/bucket.rs
  • model_gateway/src/policies/cache_aware.rs
  • model_gateway/src/policies/consistent_hashing.rs
  • model_gateway/src/policies/manual.rs
  • model_gateway/src/policies/mod.rs
  • model_gateway/src/policies/power_of_two.rs
  • model_gateway/src/policies/prefix_hash.rs
  • model_gateway/src/policies/random.rs
  • model_gateway/src/policies/registry.rs
  • model_gateway/src/policies/round_robin.rs
  • model_gateway/src/routers/http/router.rs
  • model_gateway/src/worker/builder.rs
  • model_gateway/src/worker/registry.rs
  • model_gateway/src/worker/worker.rs
  • model_gateway/src/workflow/steps/external/create_workers.rs
  • model_gateway/src/workflow/steps/local/create_worker.rs
  • model_gateway/src/workflow/steps/shared/activate.rs
  • model_gateway/tests/common/mod.rs
  • model_gateway/tests/common/test_app.rs
  • model_gateway/tests/load_guard_raii_test.rs
  • model_gateway/tests/otel_tracing_test.rs
  • model_gateway/tests/routing/cache_aware_backward_compat_test.rs
  • model_gateway/tests/routing/test_openai_routing.rs

slin1237 added a commit that referenced this pull request Apr 12, 2026
Production fixes:

1. **Failed is terminal** (worker.rs): The success branch used
   `current_status != Ready` which also matched `Failed`, allowing
   silently reintroducing Failed workers via successful probes. Changed
   to `matches!(.., Pending | NotReady)` so Failed workers stay Failed
   until explicitly replaced.

2. **gRPC connect errors trigger Pending timeout** (worker.rs): probe
   results now use `unwrap_or(false)` instead of `?`, so transport-level
   errors (e.g. gRPC connection failures) increment counters and reach
   the Pending→Failed timeout path. Without this, misconfigured gRPC
   workers stayed Pending indefinitely.

3. **Drop `*failed` gate from removal** (registry.rs): Pending→Failed
   via `max_pending_probes` can happen on a probe that returned
   `Ok(false)` (not `Err`). The previous gate would skip removal in
   that case. Failed is terminal — any worker in Failed state is
   removed when `--remove-unhealthy-workers` is enabled.

4. **Preserve status on replace** (registry.rs): metadata-only updates
   (e.g. priority, labels) used to drop a healthy worker back to
   Pending because the builder always starts health-checked workers
   as Pending. This caused ~30s of avoidable 503s for single-worker
   models. `replace()` now copies the old worker's status to the new
   worker before insertion. Added regression test.

5. **`new_with_config` no longer mutates incoming config**
   (tests/common/mod.rs): Tests that need to exercise Pending/Failed
   semantics can now pass a config with health checks enabled.
   `AppTestContext::new()` and `TestRouterConfig::*` still default to
   `disable_health_check: true` for tests that don't care.

Nit fixes:

6. **Reset consecutive_failures on Pending→Failed** (worker.rs):
   consistency with other failure-path transitions.

7. **Guard set_healthy in ActivateWorkersStep** (activate.rs): only
   call set_healthy if not already Ready, to avoid extra state churn.

Test fixes:
- Updated `TestRouterConfig` to apply `disable_health_check: true`
  via a shared `apply_test_defaults()` helper across all 11 builder
  methods (eliminates duplication)
- Updated 16 inline `RouterConfig::builder()` configs across
  reliability, routing, api test files to set `disable_health_check`
- Added `test_replace_preserves_status` regression test

All 981 tests pass (513 lib + 468 integration).

Refs: PR #1102 review feedback
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
model_gateway/src/worker/worker.rs (1)

596-685: ⚠️ Potential issue | 🟠 Major

Pending/Failed still get flattened to NotReady in WorkerInfo.

This state machine now makes Pending and Failed real runtime states, but worker_to_info() still derives status from is_healthy(). Any caller of worker_to_info() will therefore never see Pending or Failed, which defeats the new lifecycle semantics outside the in-memory worker object.

♻️ Suggested fix
 pub fn worker_to_info(worker: &Arc<dyn Worker>) -> WorkerInfo {
     let metadata = worker.metadata();
     let spec = metadata.spec.clone();
     let is_healthy = worker.is_healthy();
+    let status = worker.status();

     WorkerInfo {
         id: worker.url().to_string(),
         model_id: spec.models.primary().map(|m| m.id.clone()),
         spec,
         is_healthy,
-        status: Some(if is_healthy {
-            WorkerStatus::Ready
-        } else {
-            WorkerStatus::NotReady
-        }),
+        status: Some(status),
         load: worker.load(),
         job_status: None,
     }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/worker/worker.rs` around lines 596 - 685, worker_to_info
currently derives the public status from is_healthy(), which collapses
Pending/Failed into NotReady; update worker_to_info to read the worker's actual
runtime status (use the Worker::status() method or the stored metadata.status
field) and map that WorkerStatus value directly into WorkerInfo.status so
Pending and Failed are preserved for callers; ensure any existing boolean
is_healthy logic is kept for health checks only and adjust the mapping code in
worker_to_info (and any helper that converts WorkerStatus→WorkerInfo status) to
accept/propagate the full WorkerStatus enum values instead of relying on
is_healthy().
♻️ Duplicate comments (1)
model_gateway/tests/common/mod.rs (1)

417-420: ⚠️ Potential issue | 🟠 Major

Respect caller-provided health-check config in lower-level OpenAI test context builders.

Line 417, Line 556, and Line 694 hardcode disable_health_check: true for external workers, which overrides explicit test configs and prevents OpenAI-mode tests from exercising Pending/Failed startup semantics when needed.

♻️ Proposed fix
-                        .health_config(openai_protocol::worker::HealthCheckConfig {
-                            disable_health_check: true,
-                            ..Default::default()
-                        })
+                        .health_config(config.health_check.clone())

Also applies to: 556-559, 694-697

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/tests/common/mod.rs` around lines 417 - 420, The builders are
forcing disable_health_check: true in calls to
health_config(openai_protocol::worker::HealthCheckConfig { disable_health_check:
true, ..Default::default() }) which overrides any caller-provided test config;
update those spots to respect the caller's HealthCheckConfig (or fall back to
Default::default()) instead of hardcoding disable_health_check, i.e., pass
through the provided HealthCheckConfig value into health_config or merge it with
defaults so disable_health_check reflects the caller's setting (affects the
occurrences that use openai_protocol::worker::HealthCheckConfig and the
health_config(...) calls).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@model_gateway/src/worker/worker.rs`:
- Around line 596-685: worker_to_info currently derives the public status from
is_healthy(), which collapses Pending/Failed into NotReady; update
worker_to_info to read the worker's actual runtime status (use the
Worker::status() method or the stored metadata.status field) and map that
WorkerStatus value directly into WorkerInfo.status so Pending and Failed are
preserved for callers; ensure any existing boolean is_healthy logic is kept for
health checks only and adjust the mapping code in worker_to_info (and any helper
that converts WorkerStatus→WorkerInfo status) to accept/propagate the full
WorkerStatus enum values instead of relying on is_healthy().

---

Duplicate comments:
In `@model_gateway/tests/common/mod.rs`:
- Around line 417-420: The builders are forcing disable_health_check: true in
calls to health_config(openai_protocol::worker::HealthCheckConfig {
disable_health_check: true, ..Default::default() }) which overrides any
caller-provided test config; update those spots to respect the caller's
HealthCheckConfig (or fall back to Default::default()) instead of hardcoding
disable_health_check, i.e., pass through the provided HealthCheckConfig value
into health_config or merge it with defaults so disable_health_check reflects
the caller's setting (affects the occurrences that use
openai_protocol::worker::HealthCheckConfig and the health_config(...) calls).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1f4dfc2c-510e-4d2a-8824-b0b1e9469b22

📥 Commits

Reviewing files that changed from the base of the PR and between 53e5233 and c88c26e.

📒 Files selected for processing (15)
  • model_gateway/src/worker/registry.rs
  • model_gateway/src/worker/worker.rs
  • model_gateway/src/workflow/steps/shared/activate.rs
  • model_gateway/tests/api/api_endpoints_test.rs
  • model_gateway/tests/api/parser_endpoints_test.rs
  • model_gateway/tests/common/mod.rs
  • model_gateway/tests/common/test_config.rs
  • model_gateway/tests/reliability/circuit_breaker_test.rs
  • model_gateway/tests/reliability/rate_limiting_test.rs
  • model_gateway/tests/reliability/retries_test.rs
  • model_gateway/tests/routing/header_forwarding_test.rs
  • model_gateway/tests/routing/payload_size_test.rs
  • model_gateway/tests/routing/pd_routing_test.rs
  • model_gateway/tests/routing/power_of_two_test.rs
  • model_gateway/tests/routing/service_discovery_test.rs

Production fixes:

1. **Failed is terminal** (worker.rs): The success branch used
   `current_status != Ready` which also matched `Failed`, allowing
   silently reintroducing Failed workers via successful probes. Changed
   to `matches!(.., Pending | NotReady)` so Failed workers stay Failed
   until explicitly replaced.

2. **gRPC connect errors trigger Pending timeout** (worker.rs): probe
   results now use `unwrap_or(false)` instead of `?`, so transport-level
   errors (e.g. gRPC connection failures) increment counters and reach
   the Pending→Failed timeout path. Without this, misconfigured gRPC
   workers stayed Pending indefinitely.

3. **Drop `*failed` gate from removal** (registry.rs): Pending→Failed
   via `max_pending_probes` can happen on a probe that returned
   `Ok(false)` (not `Err`). The previous gate would skip removal in
   that case. Failed is terminal — any worker in Failed state is
   removed when `--remove-unhealthy-workers` is enabled.

4. **Preserve status on replace** (registry.rs): metadata-only updates
   (e.g. priority, labels) used to drop a healthy worker back to
   Pending because the builder always starts health-checked workers
   as Pending. This caused ~30s of avoidable 503s for single-worker
   models. `replace()` now copies the old worker's status to the new
   worker before insertion. Added regression test.

5. **`new_with_config` no longer mutates incoming config**
   (tests/common/mod.rs): Tests that need to exercise Pending/Failed
   semantics can now pass a config with health checks enabled.
   `AppTestContext::new()` and `TestRouterConfig::*` still default to
   `disable_health_check: true` for tests that don't care.

Nit fixes:

6. **Reset consecutive_failures on Pending→Failed** (worker.rs):
   consistency with other failure-path transitions.

7. **Guard set_healthy in ActivateWorkersStep** (activate.rs): only
   call set_healthy if not already Ready, to avoid extra state churn.

Test fixes:
- Updated `TestRouterConfig` to apply `disable_health_check: true`
  via a shared `apply_test_defaults()` helper across all 11 builder
  methods (eliminates duplication)
- Updated 16 inline `RouterConfig::builder()` configs across
  reliability, routing, api test files to set `disable_health_check`
- Added `test_replace_preserves_status` regression test

All 981 tests pass (513 lib + 468 integration).

Refs: PR #1102 review feedback
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
@slin1237 slin1237 force-pushed the refactor/worker-pending-state-machine branch from c88c26e to e2c8b68 Compare April 12, 2026 12:56
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e2c8b68749

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
model_gateway/tests/common/mod.rs (1)

417-420: ⚠️ Potential issue | 🟠 Major

Respect caller health-check config in shared create_test_context* helpers.

At Line 417, Line 556, and Line 694, the helpers still hardcode disable_health_check: true for OpenAI-mode worker registration. That overrides caller intent and prevents tests using these lower-level builders from validating Pending/Failed startup semantics.

💡 Suggested fix
-                        .health_config(openai_protocol::worker::HealthCheckConfig {
-                            disable_health_check: true,
-                            ..Default::default()
-                        })
+                        .health_config(config.health_check.clone())

Apply the same replacement in all three helper functions:

  • create_test_context
  • create_test_context_with_parsers
  • create_test_context_with_mcp_config

Also applies to: 556-559, 694-697

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/tests/common/mod.rs` around lines 417 - 420, The helpers
create_test_context, create_test_context_with_parsers, and
create_test_context_with_mcp_config currently force
openai_protocol::worker::HealthCheckConfig { disable_health_check: true } when
registering the OpenAI-mode worker; update each helper to respect the
caller-provided HealthCheckConfig instead of hardcoding
disable_health_check=true — e.g., use the passed health_config (or merge with
Default::default()) when building the worker registration rather than overriding
disable_health_check, and remove the hardcoded disable_health_check assignment
so tests can control Pending/Failed startup semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/worker/registry.rs`:
- Around line 468-474: Currently the code unconditionally copies
old_worker.status() into new_worker
(new_worker.set_status(old_worker.status())), which preserves Failed or
Pending/NotReady states and breaks recovery; change this to copy the old status
only when it's Healthy (or when it's Pending/NotReady and the new worker will
still be health-checked). Concretely, replace the unconditional set_status call
with a conditional: if old_worker.status().is_healthy() then
new_worker.set_status(old_worker.status()); else if
old_worker.status().is_pending_or_not_ready() &&
!new_worker.config().disable_health_check {
new_worker.set_status(old_worker.status()); } otherwise leave the new worker at
its builder-initialized status so check_health_async() and explicit replacement
logic can recover Failed and handle disabled health checks correctly; reference
new_worker.set_status, old_worker.status(), check_health_async(), and
disable_health_check in your change.

In `@model_gateway/src/worker/worker.rs`:
- Around line 627-639: worker_to_info() currently computes WorkerInfo.status via
is_healthy(), which hides the actual WorkerStatus (Pending, NotReady, Failed,
Ready); change worker_to_info() to read the real status (the WorkerStatus stored
on the worker) and map that enum directly into the WorkerInfo.status field
instead of calling is_healthy(). Update any other serialization logic in the
same file (the block covering the later mapping around the 646-675 region) to
convert WorkerStatus -> WorkerInfo status values so Pending and Failed are
preserved when returned to API/TUI consumers.
- Around line 608-620: The Pending-probe cap is only enforced when a probe
fails, allowing a successful probe to bypass max_pending_probes; after computing
probe_result/health_result (from http_health_check()/grpc_health_check()) and
regardless of true/false, check whether current_status == WorkerStatus::Pending
and self.total_pending_probes.fetch_add(1, Ordering::Relaxed) (or the
already-incremented counter) has reached or exceeded self.max_pending_probes; if
so, transition the worker to Failed (same behavior as the failure arm) and
short-circuit further success/failure handling. Apply the same change to the
other probe blocks referenced (the regions around where health_result is
evaluated at 622-639 and 662-675) so the max_pending_probes check runs
unconditionally after each Pending probe result.

---

Duplicate comments:
In `@model_gateway/tests/common/mod.rs`:
- Around line 417-420: The helpers create_test_context,
create_test_context_with_parsers, and create_test_context_with_mcp_config
currently force openai_protocol::worker::HealthCheckConfig {
disable_health_check: true } when registering the OpenAI-mode worker; update
each helper to respect the caller-provided HealthCheckConfig instead of
hardcoding disable_health_check=true — e.g., use the passed health_config (or
merge with Default::default()) when building the worker registration rather than
overriding disable_health_check, and remove the hardcoded disable_health_check
assignment so tests can control Pending/Failed startup semantics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 91ef9f22-2e0b-4d97-aea6-c8aa8c779a65

📥 Commits

Reviewing files that changed from the base of the PR and between c88c26e and e2c8b68.

📒 Files selected for processing (15)
  • model_gateway/src/worker/registry.rs
  • model_gateway/src/worker/worker.rs
  • model_gateway/src/workflow/steps/shared/activate.rs
  • model_gateway/tests/api/api_endpoints_test.rs
  • model_gateway/tests/api/parser_endpoints_test.rs
  • model_gateway/tests/common/mod.rs
  • model_gateway/tests/common/test_config.rs
  • model_gateway/tests/reliability/circuit_breaker_test.rs
  • model_gateway/tests/reliability/rate_limiting_test.rs
  • model_gateway/tests/reliability/retries_test.rs
  • model_gateway/tests/routing/header_forwarding_test.rs
  • model_gateway/tests/routing/payload_size_test.rs
  • model_gateway/tests/routing/pd_routing_test.rs
  • model_gateway/tests/routing/power_of_two_test.rs
  • model_gateway/tests/routing/service_discovery_test.rs

The previous fix put status-preservation logic inside `WorkerRegistry::replace()`,
which conflated two concerns: the registry's job is to manage the collection,
not to decide what status a new worker should have. It also meant callers
couldn't override — if a caller actually wanted the new worker to start
Pending (e.g. after a config change that should re-validate health), they
couldn't.

This commit moves the responsibility to the right layer:

1. **Builder gets a `.status()` setter** (worker/builder.rs):
   `BasicWorkerBuilder::status(WorkerStatus)` lets callers explicitly set
   the initial status. Without it, the default is Pending for health-checked
   workers and Ready for `disable_health_check == true` (unchanged).

2. **`UpdateWorkerPropertiesStep` passes the old worker's status**
   (workflow/steps/local/update_worker_properties.rs): metadata-only updates
   read `worker.status()` and pass it via `.status(...)` so the replacement
   inherits the prior state. No more 503s while a healthy worker re-proves
   itself after a label change.

3. **Removed status preservation from `WorkerRegistry::replace()`**
   (worker/registry.rs): the registry is back to being a pure collection.
   Callers control the worker's initial state via the builder.

Tests:
- Renamed `test_replace_preserves_status` → `test_builder_status_override_on_replace`
  to reflect that the test now exercises the builder's `.status()` setter.
- Added `test_builder_default_status_is_pending` to lock in the default.

Why this is better:
- Single responsibility: registry manages collection, builder constructs workers.
- Caller control: tests and call sites can opt into either Pending or any
  other status as needed.
- Standard builder pattern: defaults are sensible, overrides are explicit.

All 982 tests pass.

Refs: PR #1102 review feedback (chatgpt-codex-connector P1)
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/workflow/steps/local/update_worker_properties.rs`:
- Around line 90-91: Don't blindly copy worker.status() into the update builder;
instead compute a safe new status using openai_protocol::worker::WorkerStatus
and pass that to .status(...). Specifically, import WorkerStatus, read let prev
= worker.status(), then set let new_status = if prev == WorkerStatus::Failed {
WorkerStatus::NotReady } else if config.disable_health_check && (prev ==
WorkerStatus::Pending || prev == WorkerStatus::NotReady) { WorkerStatus::Ready }
else { prev }; and use .status(new_status) in place of .status(worker.status())
so replacements don't inherit terminal Failed or remain unroutable when health
checks are disabled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8d0024a1-bbb3-4319-bd99-e9fe3da340b6

📥 Commits

Reviewing files that changed from the base of the PR and between e2c8b68 and cec40bf.

📒 Files selected for processing (3)
  • model_gateway/src/worker/builder.rs
  • model_gateway/src/worker/registry.rs
  • model_gateway/src/workflow/steps/local/update_worker_properties.rs

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cec40bf1d9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

slin1237 added a commit that referenced this pull request Apr 12, 2026
The previous fix used `unwrap_or(false)` which silently discarded the
error details. Operators debugging why a worker won't come Ready had no
visibility into the underlying cause (connection refused, TLS failure,
DNS resolution failure, timeout, etc.).

Switched to `unwrap_or_else` which logs the error with the worker URL
at warn level before returning false. This preserves the existing
state-machine behavior (transport errors still count as failed probes
and feed the Pending timeout / NotReady→Failed paths) while making the
underlying cause discoverable from logs.

Refs: PR #1102 review feedback
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
The previous fix used `unwrap_or(false)` which silently discarded the
error details. Operators debugging why a worker won't come Ready had no
visibility into the underlying cause (connection refused, TLS failure,
DNS resolution failure, timeout, etc.).

Switched to `unwrap_or_else` which logs the error with the worker URL
at warn level before returning false. This preserves the existing
state-machine behavior (transport errors still count as failed probes
and feed the Pending timeout / NotReady→Failed paths) while making the
underlying cause discoverable from logs.

Refs: PR #1102 review feedback
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
@slin1237 slin1237 force-pushed the refactor/worker-pending-state-machine branch from 1fa4b49 to cb307ad Compare April 12, 2026 13:32
slin1237 added a commit that referenced this pull request Apr 12, 2026
… rot

Self-review of PR 6b found two issues that needed fixing:

1. **Test rot**: `test_health_checker_removes_unhealthy_workers` was
   asserting `healthy_workers == 0` after failed probes. With the new
   Pending start semantics, that assertion is true from the moment of
   registration (Pending workers are not healthy by definition), so
   the test passed vacuously without ever exercising removal or any
   state transition. Replaced with two focused tests:
   - `test_pending_worker_stays_pending_under_failed_probes`: verifies
     that Pending is sticky during early failures (does not transition
     to NotReady).
   - `test_failed_worker_can_be_removed_from_registry`: verifies the
     removal API works on workers in Failed state, without needing
     the async health checker loop or a JobQueue.

   Also renamed `test_health_checker_keeps_unhealthy_workers_when_disabled`
   → `test_health_checker_keeps_workers_when_remove_unhealthy_disabled`
   to better describe what it tests.

2. **Wasted work on Failed workers**: `check_health_async` was running
   the full HTTP/gRPC probe and incrementing counters for workers in
   the terminal Failed state. Added an early return that skips the
   probe and returns Err immediately. The health checker loop's
   removal logic is unaffected — it still gates on
   `worker.status() == Failed`, so removal happens on the next tick.
   The Failed match arm is kept (empty) because the Rust compiler
   requires exhaustive matching, and removing it would make adding
   future variants silent.

All 983 tests pass.

Refs: PR #1102 self-review
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ddc0b41131

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

… rot

Self-review of PR 6b found two issues that needed fixing:

1. **Test rot**: `test_health_checker_removes_unhealthy_workers` was
   asserting `healthy_workers == 0` after failed probes. With the new
   Pending start semantics, that assertion is true from the moment of
   registration (Pending workers are not healthy by definition), so
   the test passed vacuously without ever exercising removal or any
   state transition. Replaced with two focused tests:
   - `test_pending_worker_stays_pending_under_failed_probes`: verifies
     that Pending is sticky during early failures (does not transition
     to NotReady).
   - `test_failed_worker_can_be_removed_from_registry`: verifies the
     removal API works on workers in Failed state, without needing
     the async health checker loop or a JobQueue.

   Also renamed `test_health_checker_keeps_unhealthy_workers_when_disabled`
   → `test_health_checker_keeps_workers_when_remove_unhealthy_disabled`
   to better describe what it tests.

2. **Wasted work on Failed workers**: `check_health_async` was running
   the full HTTP/gRPC probe and incrementing counters for workers in
   the terminal Failed state. Added an early return that skips the
   probe and returns Err immediately. The health checker loop's
   removal logic is unaffected — it still gates on
   `worker.status() == Failed`, so removal happens on the next tick.
   The Failed match arm is kept (empty) because the Rust compiler
   requires exhaustive matching, and removing it would make adding
   future variants silent.

All 983 tests pass.

Refs: PR #1102 self-review
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
@slin1237 slin1237 force-pushed the refactor/worker-pending-state-machine branch from ddc0b41 to 60cca3f Compare April 12, 2026 14:11
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 60cca3f249

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Four real bugs surfaced by the latest review round:

1. **`worker_to_info()` collapses Pending/Failed to NotReady**
   (worker.rs): API/TUI consumers were seeing `status: NotReady` for
   workers in Pending or Failed because the helper computed status from
   `is_healthy()` (which is `status == Ready`). Now reads `worker.status()`
   directly and exposes the real lifecycle state. `is_healthy` is still
   derived from the same snapshot for backwards compatibility, no TOCTOU.

2. **`max_pending_probes` not enforced on success branch** (worker.rs):
   `total_pending_probes` is incremented before every Pending probe,
   but the cap was only checked in the failure branch. A flapping
   pattern (F,S,F,S,...) never reaches `success_threshold`, never
   transitions, and grows the counter unbounded. Added the cap check
   to the success branch — if a Pending worker probes successfully but
   has exceeded `max_pending_probes` total attempts, it transitions to
   Failed.

3. **`update_worker_properties.rs` blindly preserves status**
   (workflow/steps/local/update_worker_properties.rs): Two real bugs:
   - **Failed workers stay Failed**: a metadata-only update on a Failed
     worker would inherit the terminal status. Now compute a fresh
     `Pending` for Failed workers (they get a chance to recover via
     the standard health check loop after the update).
   - **disable_health_check flip leaves stale status**: if the update
     flips `disable_health_check` from false → true while the worker
     was Pending/NotReady, the new worker would be Pending forever
     (the health checker skips disabled workers). Now force `Ready`
     when `disable_health_check == true`.

4. **`ActivateWorkersStep` doesn't activate when global health checker
   is off** (workflow/steps/shared/activate.rs): `server.rs` skips
   starting the health checker entirely when global
   `disable_health_check` is set. Workers created with per-worker
   `disable_health_check: false` would stay Pending forever because no
   background task probes them. Now reads
   `app_context.router_config.health_check.disable_health_check` and
   force-activates all workers when the global checker won't run.

Also fixed a stale comment in worker.rs that said Failed workers can
recover via `register_or_replace`. With the recent changes, `replace()`
preserves the old status, so the actual recovery path is removal +
re-registration.

All 983 tests pass, clippy clean.

Refs: PR #1102 review feedback (chatgpt-codex-connector P1, coderabbitai)
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
Problem: e2e tests fail with "no available workers" because workflow-
created workers stay Pending until the background health checker
eventually promotes them. With default config, that takes 120s
(check_interval=60 × success_threshold=2), longer than the 60s test
gateway timeout.

Root cause analysis: my previous ActivateWorkersStep guard (only
activate `disable_health_check` workers) was wrong. I missed that
the workflow already does a real health probe via DetectBackendStep
(HTTP /health GET or gRPC health_check). By the time ActivateWorkers
runs, the worker has already been proven reachable — it's safe (and
correct) to flip Pending → Ready immediately.

This restores the pre-state-machine workflow behavior at the trusted-
creation boundary, while keeping the state machine for ongoing
failures: the background health checker still demotes workers
(Ready → NotReady → Failed) if probes start failing later. Pending
remains the correct start state for paths that don't go through the
workflow's connectivity proof (e.g. future mesh-imported workers).

Changes:
- activate.rs: simplified to always call set_healthy(true). Added
  rationale comment explaining why this is correct (DetectBackendStep
  already proved reachability).
- update_worker_properties.rs: simplified next_status logic. A
  metadata-only update preserves the worker's prior status (the
  endpoint hasn't changed, only labels/priority/etc.). Kept the one
  exception: flipping disable_health_check to true forces Ready,
  because the health checker skips disabled workers and a stale
  Pending/NotReady would never recover.

All 983 tests pass, clippy clean.

Refs: PR #1102 e2e test failures
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 042ab0ef0d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…health_check

Workers created through the workflow already had their connectivity
proven by DetectBackendStep (HTTP /health probe or gRPC health_check)
before reaching ActivateWorkersStep. Flipping them Pending → Ready
here is correct because the workflow's connectivity proof is functionally
equivalent to one successful state-machine probe.

Without this, workflow-created workers stayed Pending until the
background health checker promoted them after `check_interval ×
success_threshold` seconds (~120s with defaults), exceeding e2e
gateway startup timeouts.

The state machine still applies for ongoing failures: the background
health checker demotes Ready → NotReady → Failed on probe failures.
Pending remains the start state for paths that don't pre-validate
(e.g. mesh-imported workers in future PRs).

Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
@slin1237 slin1237 merged commit d954c14 into main Apr 12, 2026
79 of 81 checks passed
@slin1237 slin1237 deleted the refactor/worker-pending-state-machine branch April 12, 2026 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model-gateway Model gateway crate changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant