refactor(worker): land Pending start semantics and state machine by slin1237 · Pull Request #1102 · lightseekorg/smg

slin1237 · 2026-04-12T02:22:52Z

Description

Problem

Workers start Ready (routable) immediately at registration, before any local health check has verified them. This causes the mesh-sync 500/503 bug where unverified workers receive traffic and fail. The WorkerStatus enum from #1093 defined four states but only two were actually used.

Solution

Workers with health checks enabled now start Pending (not routable). The health checker promotes them to Ready after success_threshold consecutive successful probes. Full state machine transitions implemented in check_health_async().

State Machine

             register()
                 │
        disable_health_check?
           /          \
        yes            no
         │              │
       Ready         Pending ──────┐
                        │          │
                        │          │ max_pending_probes (10× failure_threshold)
                        │          │ total attempts exceeded
               success_threshold   ▼
               consecutive       Failed
                    │
                    ▼
            ┌────Ready◄────────┐
            │      │           │
failure_threshold  │     success_threshold
  consecutive      │       consecutive
            │      ▼           │
            └─►NotReady────────┘
                    │
                    │ liveness_failure_threshold
                    │ (3× failure_threshold) consecutive
                    ▼
                 Failed

Changes

Production code:

worker.rs: added total_pending_probes counter on BasicWorker
builder.rs: initial status is Pending if health checks enabled, Ready otherwise
worker.rs: check_health_async() implements the full state machine (Pending→Ready, Ready→NotReady, NotReady→Failed, NotReady→Ready, Pending→Failed)
registry.rs: --remove-unhealthy-workers only removes Failed (not Pending or NotReady)
workflow/steps/shared/activate.rs: ActivateWorkersStep only activates workers with disable_health_check == true
workflow/steps/local/create_worker.rs & external/create_workers.rs: removed redundant set_healthy calls (builder handles initial status)

Thresholds:

max_pending_probes = 10 × failure_threshold — prevents misconfigured URLs from leaking in Pending forever
liveness_failure_threshold = 3 × failure_threshold — prevents aggressive removal of transiently unhealthy workers (analogous to K8s having separate readiness and liveness probes)

Tests:

Added no_health_check() helper in test modules
Updated ~80 test builders across policies/*, routers/http/router.rs, and integration tests to use disable_health_check: true
Integration test helpers (common/mod.rs, test_app.rs) default to disable_health_check: true
Added test_pending_worker_not_routable to verify new behavior

What's NOT in this PR

StatusChanged events from the health path — deferred to PR 7 when WorkerManager owns the health loop
Mesh sync rolling upgrade handling — deferred to mesh v2 / WorkerSyncAdapter

Behavior changes

--remove-unhealthy-workers is now stricter: only Failed workers are removed. Workers in Pending or NotReady are left alone. This is a fix — the old behavior removed transiently unhealthy workers too aggressively.
Workers take check_interval_secs × success_threshold seconds to become routable after registration (default: ~30 seconds with 15s interval × 2 threshold). Workers with disable_health_check == true are Ready immediately.

Test Plan

cargo check --package smg --package smg-golang — clean
cargo test --package smg --lib — 512 tests pass
cargo test --package smg --test api_tests — 100 tests pass
cargo test --package smg --test routing_tests — 92 tests pass
All other integration test suites pass

Refs: worker-module-deep-refactor plan (PR 6b)

Release constraint: This PR must roll forward directly into PR 7 (WorkerManager extraction). PR 6b is an implementation staging PR — `StatusChanged` events become authoritative in PR 7.

Checklist

Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

Chores
- Worker startup/activation now follow health-check config: builders set initial readiness; activation only marks health-check-disabled workers ready; health-checked workers start pending.
- Worker health lifecycle and tracking improved (clearer status transitions, pending-probe counting); replacements preserve prior runtime status and removals only occur for truly failed workers.
Tests
- Many tests and helpers updated to disable health checks where needed to stabilize startup, routing, and reliability tests.

coderabbitai · 2026-04-12T02:23:13Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Worker health model changed: builder now controls initial WorkerStatus (Pending/Ready) and exposes a status setter; BasicWorker tracks pending probe counts and revised health-check transitions; registry removal logic now only removes terminal Failed status; various workflow steps and many tests updated to disable health checks in test setups.

Changes

Cohort / File(s)	Summary
Core worker implementation `model_gateway/src/worker/worker.rs`, `model_gateway/src/worker/builder.rs`	Introduce `WorkerStatus`-aware builder (`status(...)`), builder-driven initial status (Pending vs Ready), add `total_pending_probes`, update health-check state machine and probe counting; public builder method added.
Registry & lifecycle `model_gateway/src/worker/registry.rs`	Change removal criteria to only remove workers with terminal `WorkerStatus::Failed`; preserve explicit status on replace; add regression tests for status preservation.
Workflow & activation steps `model_gateway/src/workflow/steps/external/create_workers.rs`, `model_gateway/src/workflow/steps/local/create_worker.rs`, `model_gateway/src/workflow/steps/shared/activate.rs`, `model_gateway/src/workflow/steps/local/update_worker_properties.rs`	Stop mutating worker health after build; delegate initial status to builder; Activate step only marks workers ready when health checks are disabled and worker not already healthy; update replace to preserve existing status.
Policy tests `model_gateway/src/policies//.rs` (multiple policy files: `cache_aware.rs`, `consistent_hashing.rs`, `manual.rs`, `mod.rs`, `power_of_two.rs`, `prefix_hash.rs`, `random.rs`, `registry.rs`, `round_robin.rs`)	Test modules updated to import `HealthCheckConfig` and add `no_health_check()` helper; all test-built workers now call `.health_config(no_health_check())` to disable health checks in tests.
Router tests & HTTP router `model_gateway/src/routers/http/router.rs`, `model_gateway/tests/`, `model_gateway/tests/common/`, `model_gateway/tests/api/`, `model_gateway/tests/routing/`, `model_gateway/tests/reliability/*`	Many integration/unit tests changed to build mutable `RouterConfig` and set `health_check.disable_health_check = true`; test helpers updated to create external/mock workers with disabled health checks.
Metrics & tests alignment various unit tests updated across worker and builder tests	Adjust tests to new default Pending behavior and updated expectations for status and metrics initialization.

Sequence Diagram(s)

sequenceDiagram
    participant Workflow as Workflow/CreateWorker
    participant Builder as BasicWorkerBuilder
    participant Worker as BasicWorker
    participant Registry as WorkerRegistry
    participant Health as HealthChecker

    Workflow->>Builder: configure metadata (+health_config, optional status)
    Builder->>Worker: build() -> set initial WorkerStatus (Ready if disable_health_check, else Pending)
    Worker->>Registry: register(worker snapshot)
    par async health lifecycle
        Health->>Worker: check_health_async()
        alt health checks disabled
            Worker-->>Registry: remain Ready (no probes)
        else enabled
            Worker-->>Worker: increment total_pending_probes if Pending
            Worker-->>Registry: update status transitions (Pending->Ready, Ready->NotReady, NotReady->Failed)
        end
    end
    Note right of Registry: removal only when status == Failed

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

refactor(worker): replace healthy AtomicBool with status AtomicU8 #1101 — modifies worker subsystem and builder/worker status handling; directly overlaps builder/status/worker changes.
refactor(protocols): add WorkerStatus enum to protocol crate #1093 — introduces/changes WorkerStatus and worker-to-info transformations; closely related to status integration.
refactor(protocols,gateway): redesign worker type hierarchy and consolidate protocol layer #412 — adjusts HealthCheckConfig/WorkerSpec plumbing and builders; related to health_config propagation and test updates.

Suggested reviewers

CatherineSue
key4ng
whybeyoung

Poem

"I hopped through code at break of day,
Builders set statuses my way,
Pending or Ready, I beep and cheer,
Probes counted—no more fear 🐇,
Tests now pass, batteries near!"

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately and concisely summarizes the primary change: implementing Pending start semantics and a state machine for worker health checking.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/worker-pending-state-machine

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

model_gateway/src/worker/worker.rs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8aba682681

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

model_gateway/src/worker/worker.rs

claude

Clean state machine implementation. Two nits on the recovery path semantics — both are about comment accuracy and a minor counter-reset inconsistency, not correctness bugs.

Summary: 0 🔴 Important · 2 🟡 Nit · 0 🟣 Pre-existing

gemini-code-assist

Code Review

This pull request replaces the boolean health status for workers with a state machine consisting of Pending, Ready, NotReady, and Failed states. Workers with health checks enabled now initialize as Pending and transition to Ready only after meeting a success threshold. The update also introduces logic for handling failures, including a transition to Failed after exceeding failure or probe thresholds. Feedback was provided regarding the Failed state, noting that the current implementation allows recovery to Ready, which contradicts the documentation suggesting Failed should be a terminal state.

model_gateway/src/worker/worker.rs

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/policies/bucket.rs`:
- Around line 570-575: The test helper no_health_check() is duplicated across
test modules; extract it into a shared test utilities module (e.g., a new
tests::helpers or crate::test_utils) and update tests to import and call that
single function instead of their local copies. Move the function signature and
return value (HealthCheckConfig with disable_health_check: true and
Default::default() for the rest) into the shared helper, ensure the module is
annotated for test-only use (cfg(test) or in tests/), and update references in
policies::bucket tests to use the centralized no_health_check helper.

In `@model_gateway/src/policies/mod.rs`:
- Around line 190-195: The test helper no_health_check() that returns a
HealthCheckConfig with disable_health_check: true is duplicated across policy
test modules; extract and centralize it into a shared test utilities module
(e.g., tests::utils or a new policies::test_helpers) and update callers to
import and use the single no_health_check function instead of local copies,
ensuring the signature and returned type (HealthCheckConfig) remain unchanged so
functions referencing no_health_check() continue to compile.

In `@model_gateway/src/worker/registry.rs`:
- Around line 996-1001: The worker eviction currently checks both
worker.status() == WorkerStatus::Failed and the boolean gate *failed from
checked_workers, which prevents removing workers that transitioned to Failed
without an Err probe; update the eviction logic in registry.rs (the loop
iterating over checked_workers when job_queue.is_some()) to only rely on
worker.status() == WorkerStatus::Failed (i.e., drop the extra *failed condition)
so that any worker in the terminal Failed state is removed when
--remove-unhealthy-workers is enabled.

In `@model_gateway/src/worker/worker.rs`:
- Around line 654-661: The Pending→Failed branch doesn't reset the
consecutive_failures counter, unlike the other failure transitions; after
detecting total >= max_pending and calling self.set_status(WorkerStatus::Failed)
in the WorkerStatus::Pending branch, also reset the counter (e.g.,
self.consecutive_failures.store(0, Ordering::Relaxed)) so consecutive_failures
is cleared consistently with the Ready→NotReady and NotReady→Failed transitions;
locate this in the block that computes max_pending from
health_config.failure_threshold and reads self.total_pending_probes.

In `@model_gateway/tests/common/mod.rs`:
- Around line 206-207: The test helper is forcibly disabling health checks by
setting config.health_check.disable_health_check = true, which prevents callers
from enabling health checks; change this so only AppTestContext::new() defaults
to disabling health checks, but do not mutate the passed config in
new_with_config() or the lower-level context builders—respect the incoming
config.health_check values instead (update new_with_config(), related builders,
and any helper locations around the places called out: the blocks near lines
206, 416-419, 555-558, 693-696) so tests that pass an enabled health_check can
exercise Pending/Failed semantics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d7bd91a0-dc16-48a9-b8f1-b5843873c930

📥 Commits

Reviewing files that changed from the base of the PR and between ae14a9c and 8aba682.

📒 Files selected for processing (23)

model_gateway/src/policies/bucket.rs
model_gateway/src/policies/cache_aware.rs
model_gateway/src/policies/consistent_hashing.rs
model_gateway/src/policies/manual.rs
model_gateway/src/policies/mod.rs
model_gateway/src/policies/power_of_two.rs
model_gateway/src/policies/prefix_hash.rs
model_gateway/src/policies/random.rs
model_gateway/src/policies/registry.rs
model_gateway/src/policies/round_robin.rs
model_gateway/src/routers/http/router.rs
model_gateway/src/worker/builder.rs
model_gateway/src/worker/registry.rs
model_gateway/src/worker/worker.rs
model_gateway/src/workflow/steps/external/create_workers.rs
model_gateway/src/workflow/steps/local/create_worker.rs
model_gateway/src/workflow/steps/shared/activate.rs
model_gateway/tests/common/mod.rs
model_gateway/tests/common/test_app.rs
model_gateway/tests/load_guard_raii_test.rs
model_gateway/tests/otel_tracing_test.rs
model_gateway/tests/routing/cache_aware_backward_compat_test.rs
model_gateway/tests/routing/test_openai_routing.rs

model_gateway/src/policies/bucket.rs

model_gateway/src/policies/mod.rs

model_gateway/src/worker/registry.rs

model_gateway/src/worker/worker.rs

model_gateway/tests/common/mod.rs

What changed: Production code: - worker.rs: added `total_pending_probes: Arc<AtomicUsize>` to BasicWorker for tracking total probe attempts in Pending state - builder.rs: initializes workers as `Pending` when health checks are enabled, `Ready` when `disable_health_check == true`. Previously all workers started Ready regardless of config. - worker.rs: `check_health_async()` implements the full state machine: Pending → Ready on `success_threshold` consecutive passes Pending → Failed on `max_pending_probes` (10× failure_threshold) total attempts Ready → NotReady on `failure_threshold` consecutive failures NotReady → Ready on `success_threshold` consecutive passes NotReady → Failed on `liveness_failure_threshold` (3× failure_threshold) - registry.rs: `start_health_checker` only removes `Failed` workers (not Pending or NotReady) when `--remove-unhealthy-workers` is set - workflow/steps/shared/activate.rs: ActivateWorkersStep only activates workers with `disable_health_check == true`. Health-checked workers stay Pending — the health checker promotes them after passing probes. - workflow/steps/local/create_worker.rs: removed redundant set_healthy call (builder now handles initial status) - workflow/steps/external/create_workers.rs: same removal Tests: - worker.rs: added `no_health_check()` helper and `test_pending_worker_not_routable` - builder.rs: updated assertions to match new Pending-by-default behavior - policies/*.rs: added `.health_config(no_health_check())` to ~73 test builders so policy tests get routable workers (they don't test health lifecycle, they test routing decisions) - routers/http/router.rs: same fix for router tests - tests/common/mod.rs: `new_with_config` sets `disable_health_check: true` so all integration tests get routable workers immediately - tests/common/test_app.rs: same for external worker helper - tests/routing/cache_aware_backward_compat_test.rs: added no_health_check helper and applied to 8 builders - tests/routing/test_openai_routing.rs: added health config - tests/load_guard_raii_test.rs: added health config - tests/otel_tracing_test.rs: set disable_health_check on test config Why: Workers that haven't been verified locally should not receive traffic. Starting Pending instead of Ready prevents the mesh-sync 500/503 bug where unverified workers were assumed healthy. The state machine (Pending → Ready → NotReady → Failed) mirrors Kubernetes pod conditions and enables proper liveness separation from readiness. How: - `max_pending_probes = 10 × failure_threshold` guards against misconfigured URLs staying in Pending forever - `liveness_failure_threshold = 3 × failure_threshold` prevents aggressive removal of transiently unhealthy workers (analogous to K8s having separate readiness and liveness probes) - `StatusChanged` events are NOT yet emitted from the health path — this is deferred to PR 7 when WorkerManager owns the health loop - `--remove-unhealthy-workers` now only removes Failed workers, not NotReady (which may recover) or Pending (which hasn't proven itself). This is a behavior fix, not a breaking change — the old behavior was a bug that removed transiently unhealthy workers too aggressively. All 512 lib tests, 100 api tests, and 200+ other integration tests pass. Refs: worker-module-deep-refactor plan (PR 6b) Signed-off-by: Simo Lin <linsimo.mark@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 53e5233f00

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

model_gateway/src/worker/builder.rs

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

model_gateway/src/worker/registry.rs (1)

1000-1000: ⚠️ Potential issue | 🟠 Major

Drop the extra *failed gate from terminal Failed eviction.

Line 1000 currently requires both WorkerStatus::Failed and *failed. A worker can already be terminal Failed while the latest probe result is not Err, which prevents eviction and leaves a non-recoverable worker registered.

🛠️ Proposed fix

-                            if worker.status() == WorkerStatus::Failed && *failed {
+                            if worker.status() == WorkerStatus::Failed {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/worker/registry.rs` at line 1000, The eviction gate
currently requires both worker.status() == WorkerStatus::Failed and *failed,
which prevents eviction when a worker is already terminal Failed but the latest
probe flag (failed) is false; remove the extra probe-based check by changing the
conditional that references worker.status() and *failed to only check
worker.status() == WorkerStatus::Failed so terminal Failed workers are evicted
regardless of the transient *failed probe flag (adjust any surrounding logic
that depended on the combined condition in the same block to keep semantics for
non-Failed states).

model_gateway/tests/common/mod.rs (1)

206-207: ⚠️ Potential issue | 🟠 Major

Don’t override caller-supplied health-check config in shared test helpers.

new_with_config() now forces config.health_check.disable_health_check = true, and the lower-level context builders hardcode the same override into every test worker. That means any test that explicitly enables health checks can no longer exercise the new Pending -> Ready/Failed startup behavior this PR is introducing. It’s fine for AppTestContext::new() to default to disabled health checks for convenience, but new_with_config() and create_test_context*() should preserve the incoming config.

Suggested direction

 pub fn new(
     worker_configs: Vec<MockWorkerConfig>,
 ) -> Pin<Box<dyn Future<Output = Self> + Send>> {
     Box::pin(async move {
-        let config = RouterConfig::builder()
+        let mut config = RouterConfig::builder()
             .regular_mode(vec![])
             .random_policy()
             .host("127.0.0.1")
             .port(3002)
             .max_payload_size(256 * 1024 * 1024)
             .request_timeout_secs(600)
             .worker_startup_timeout_secs(1)
             .worker_startup_check_interval_secs(1)
             .max_concurrent_requests(64)
             .queue_timeout_secs(60)
             .build_unchecked();
+        config.health_check.disable_health_check = true;

         Self::new_with_config(config, worker_configs).await
     })
 }

 pub fn new_with_config(
-    mut config: RouterConfig,
+    config: RouterConfig,
     worker_configs: Vec<MockWorkerConfig>,
 ) -> Pin<Box<dyn Future<Output = Self> + Send>> {
-    // Test workers don't need health checks — start Ready immediately
-    config.health_check.disable_health_check = true;
-
     Box::pin(async move {
         ...
     })
 }

And in each create_test_context*() helper, pass through the caller's setting instead of hardcoding it:

+let health_config = config.health_check.clone();
 ...
-    .health_config(openai_protocol::worker::HealthCheckConfig {
-        disable_health_check: true,
-        ..Default::default()
-    })
+    .health_config(health_config.clone())

Also applies to: 416-419, 555-558, 693-696

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@model_gateway/tests/common/mod.rs` around lines 206 - 207, new_with_config()
and the create_test_context*() helpers are mutating the caller-supplied config
by forcing config.health_check.disable_health_check = true, which prevents tests
from exercising the Pending->Ready/Failed startup flow; remove that hardcoded
override from new_with_config() and from the lower-level create_test_context*()
builders so they preserve the incoming config, leaving only
AppTestContext::new() to default-disable health checks for convenience if no
config is provided; ensure any place that currently sets
config.health_check.disable_health_check explicitly instead respects the
passed-in config value.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/workflow/steps/shared/activate.rs`:
- Around line 28-30: The Ready transition is being re-applied for workers with
disable_health_check by calling worker.set_healthy(true) in the activation loop;
instead either remove that call or guard it so you don't mutate
already-initialized status—update the block around
worker.metadata().health_config.disable_health_check to simply increment
activated (or only call worker.set_healthy(true) when the worker's current state
is not already Ready) and avoid redundant state mutations from
set_healthy(true).

---

Duplicate comments:
In `@model_gateway/src/worker/registry.rs`:
- Line 1000: The eviction gate currently requires both worker.status() ==
WorkerStatus::Failed and *failed, which prevents eviction when a worker is
already terminal Failed but the latest probe flag (failed) is false; remove the
extra probe-based check by changing the conditional that references
worker.status() and *failed to only check worker.status() ==
WorkerStatus::Failed so terminal Failed workers are evicted regardless of the
transient *failed probe flag (adjust any surrounding logic that depended on the
combined condition in the same block to keep semantics for non-Failed states).

In `@model_gateway/tests/common/mod.rs`:
- Around line 206-207: new_with_config() and the create_test_context*() helpers
are mutating the caller-supplied config by forcing
config.health_check.disable_health_check = true, which prevents tests from
exercising the Pending->Ready/Failed startup flow; remove that hardcoded
override from new_with_config() and from the lower-level create_test_context*()
builders so they preserve the incoming config, leaving only
AppTestContext::new() to default-disable health checks for convenience if no
config is provided; ensure any place that currently sets
config.health_check.disable_health_check explicitly instead respects the
passed-in config value.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e122f3b6-20af-446b-9c25-1d39b0300802

📥 Commits

Reviewing files that changed from the base of the PR and between 8aba682 and 53e5233.

📒 Files selected for processing (23)

model_gateway/src/policies/bucket.rs
model_gateway/src/policies/cache_aware.rs
model_gateway/src/policies/consistent_hashing.rs
model_gateway/src/policies/manual.rs
model_gateway/src/policies/mod.rs
model_gateway/src/policies/power_of_two.rs
model_gateway/src/policies/prefix_hash.rs
model_gateway/src/policies/random.rs
model_gateway/src/policies/registry.rs
model_gateway/src/policies/round_robin.rs
model_gateway/src/routers/http/router.rs
model_gateway/src/worker/builder.rs
model_gateway/src/worker/registry.rs
model_gateway/src/worker/worker.rs
model_gateway/src/workflow/steps/external/create_workers.rs
model_gateway/src/workflow/steps/local/create_worker.rs
model_gateway/src/workflow/steps/shared/activate.rs
model_gateway/tests/common/mod.rs
model_gateway/tests/common/test_app.rs
model_gateway/tests/load_guard_raii_test.rs
model_gateway/tests/otel_tracing_test.rs
model_gateway/tests/routing/cache_aware_backward_compat_test.rs
model_gateway/tests/routing/test_openai_routing.rs

model_gateway/src/workflow/steps/shared/activate.rs

Production fixes: 1. **Failed is terminal** (worker.rs): The success branch used `current_status != Ready` which also matched `Failed`, allowing silently reintroducing Failed workers via successful probes. Changed to `matches!(.., Pending | NotReady)` so Failed workers stay Failed until explicitly replaced. 2. **gRPC connect errors trigger Pending timeout** (worker.rs): probe results now use `unwrap_or(false)` instead of `?`, so transport-level errors (e.g. gRPC connection failures) increment counters and reach the Pending→Failed timeout path. Without this, misconfigured gRPC workers stayed Pending indefinitely. 3. **Drop `*failed` gate from removal** (registry.rs): Pending→Failed via `max_pending_probes` can happen on a probe that returned `Ok(false)` (not `Err`). The previous gate would skip removal in that case. Failed is terminal — any worker in Failed state is removed when `--remove-unhealthy-workers` is enabled. 4. **Preserve status on replace** (registry.rs): metadata-only updates (e.g. priority, labels) used to drop a healthy worker back to Pending because the builder always starts health-checked workers as Pending. This caused ~30s of avoidable 503s for single-worker models. `replace()` now copies the old worker's status to the new worker before insertion. Added regression test. 5. **`new_with_config` no longer mutates incoming config** (tests/common/mod.rs): Tests that need to exercise Pending/Failed semantics can now pass a config with health checks enabled. `AppTestContext::new()` and `TestRouterConfig::*` still default to `disable_health_check: true` for tests that don't care. Nit fixes: 6. **Reset consecutive_failures on Pending→Failed** (worker.rs): consistency with other failure-path transitions. 7. **Guard set_healthy in ActivateWorkersStep** (activate.rs): only call set_healthy if not already Ready, to avoid extra state churn. Test fixes: - Updated `TestRouterConfig` to apply `disable_health_check: true` via a shared `apply_test_defaults()` helper across all 11 builder methods (eliminates duplication) - Updated 16 inline `RouterConfig::builder()` configs across reliability, routing, api test files to set `disable_health_check` - Added `test_replace_preserves_status` regression test All 981 tests pass (513 lib + 468 integration). Refs: PR #1102 review feedback Signed-off-by: Simo Lin <linsimo.mark@gmail.com>

model_gateway/src/worker/worker.rs

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

model_gateway/src/worker/worker.rs (1)

596-685: ⚠️ Potential issue | 🟠 Major

Pending/Failed still get flattened to NotReady in WorkerInfo.

This state machine now makes Pending and Failed real runtime states, but worker_to_info() still derives status from is_healthy(). Any caller of worker_to_info() will therefore never see Pending or Failed, which defeats the new lifecycle semantics outside the in-memory worker object.

♻️ Suggested fix

 pub fn worker_to_info(worker: &Arc<dyn Worker>) -> WorkerInfo {
     let metadata = worker.metadata();
     let spec = metadata.spec.clone();
     let is_healthy = worker.is_healthy();
+    let status = worker.status();

     WorkerInfo {
         id: worker.url().to_string(),
         model_id: spec.models.primary().map(|m| m.id.clone()),
         spec,
         is_healthy,
-        status: Some(if is_healthy {
-            WorkerStatus::Ready
-        } else {
-            WorkerStatus::NotReady
-        }),
+        status: Some(status),
         load: worker.load(),
         job_status: None,
     }
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@model_gateway/src/worker/worker.rs` around lines 596 - 685, worker_to_info
currently derives the public status from is_healthy(), which collapses
Pending/Failed into NotReady; update worker_to_info to read the worker's actual
runtime status (use the Worker::status() method or the stored metadata.status
field) and map that WorkerStatus value directly into WorkerInfo.status so
Pending and Failed are preserved for callers; ensure any existing boolean
is_healthy logic is kept for health checks only and adjust the mapping code in
worker_to_info (and any helper that converts WorkerStatus→WorkerInfo status) to
accept/propagate the full WorkerStatus enum values instead of relying on
is_healthy().

♻️ Duplicate comments (1)

model_gateway/tests/common/mod.rs (1)

417-420: ⚠️ Potential issue | 🟠 Major

Respect caller-provided health-check config in lower-level OpenAI test context builders.

Line 417, Line 556, and Line 694 hardcode disable_health_check: true for external workers, which overrides explicit test configs and prevents OpenAI-mode tests from exercising Pending/Failed startup semantics when needed.

♻️ Proposed fix

-                        .health_config(openai_protocol::worker::HealthCheckConfig {
-                            disable_health_check: true,
-                            ..Default::default()
-                        })
+                        .health_config(config.health_check.clone())

Also applies to: 556-559, 694-697

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@model_gateway/tests/common/mod.rs` around lines 417 - 420, The builders are
forcing disable_health_check: true in calls to
health_config(openai_protocol::worker::HealthCheckConfig { disable_health_check:
true, ..Default::default() }) which overrides any caller-provided test config;
update those spots to respect the caller's HealthCheckConfig (or fall back to
Default::default()) instead of hardcoding disable_health_check, i.e., pass
through the provided HealthCheckConfig value into health_config or merge it with
defaults so disable_health_check reflects the caller's setting (affects the
occurrences that use openai_protocol::worker::HealthCheckConfig and the
health_config(...) calls).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@model_gateway/src/worker/worker.rs`:
- Around line 596-685: worker_to_info currently derives the public status from
is_healthy(), which collapses Pending/Failed into NotReady; update
worker_to_info to read the worker's actual runtime status (use the
Worker::status() method or the stored metadata.status field) and map that
WorkerStatus value directly into WorkerInfo.status so Pending and Failed are
preserved for callers; ensure any existing boolean is_healthy logic is kept for
health checks only and adjust the mapping code in worker_to_info (and any helper
that converts WorkerStatus→WorkerInfo status) to accept/propagate the full
WorkerStatus enum values instead of relying on is_healthy().

---

Duplicate comments:
In `@model_gateway/tests/common/mod.rs`:
- Around line 417-420: The builders are forcing disable_health_check: true in
calls to health_config(openai_protocol::worker::HealthCheckConfig {
disable_health_check: true, ..Default::default() }) which overrides any
caller-provided test config; update those spots to respect the caller's
HealthCheckConfig (or fall back to Default::default()) instead of hardcoding
disable_health_check, i.e., pass through the provided HealthCheckConfig value
into health_config or merge it with defaults so disable_health_check reflects
the caller's setting (affects the occurrences that use
openai_protocol::worker::HealthCheckConfig and the health_config(...) calls).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1f4dfc2c-510e-4d2a-8824-b0b1e9469b22

📥 Commits

Reviewing files that changed from the base of the PR and between 53e5233 and c88c26e.

📒 Files selected for processing (15)

model_gateway/src/worker/registry.rs
model_gateway/src/worker/worker.rs
model_gateway/src/workflow/steps/shared/activate.rs
model_gateway/tests/api/api_endpoints_test.rs
model_gateway/tests/api/parser_endpoints_test.rs
model_gateway/tests/common/mod.rs
model_gateway/tests/common/test_config.rs
model_gateway/tests/reliability/circuit_breaker_test.rs
model_gateway/tests/reliability/rate_limiting_test.rs
model_gateway/tests/reliability/retries_test.rs
model_gateway/tests/routing/header_forwarding_test.rs
model_gateway/tests/routing/payload_size_test.rs
model_gateway/tests/routing/pd_routing_test.rs
model_gateway/tests/routing/power_of_two_test.rs
model_gateway/tests/routing/service_discovery_test.rs

Production fixes: 1. **Failed is terminal** (worker.rs): The success branch used `current_status != Ready` which also matched `Failed`, allowing silently reintroducing Failed workers via successful probes. Changed to `matches!(.., Pending | NotReady)` so Failed workers stay Failed until explicitly replaced. 2. **gRPC connect errors trigger Pending timeout** (worker.rs): probe results now use `unwrap_or(false)` instead of `?`, so transport-level errors (e.g. gRPC connection failures) increment counters and reach the Pending→Failed timeout path. Without this, misconfigured gRPC workers stayed Pending indefinitely. 3. **Drop `*failed` gate from removal** (registry.rs): Pending→Failed via `max_pending_probes` can happen on a probe that returned `Ok(false)` (not `Err`). The previous gate would skip removal in that case. Failed is terminal — any worker in Failed state is removed when `--remove-unhealthy-workers` is enabled. 4. **Preserve status on replace** (registry.rs): metadata-only updates (e.g. priority, labels) used to drop a healthy worker back to Pending because the builder always starts health-checked workers as Pending. This caused ~30s of avoidable 503s for single-worker models. `replace()` now copies the old worker's status to the new worker before insertion. Added regression test. 5. **`new_with_config` no longer mutates incoming config** (tests/common/mod.rs): Tests that need to exercise Pending/Failed semantics can now pass a config with health checks enabled. `AppTestContext::new()` and `TestRouterConfig::*` still default to `disable_health_check: true` for tests that don't care. Nit fixes: 6. **Reset consecutive_failures on Pending→Failed** (worker.rs): consistency with other failure-path transitions. 7. **Guard set_healthy in ActivateWorkersStep** (activate.rs): only call set_healthy if not already Ready, to avoid extra state churn. Test fixes: - Updated `TestRouterConfig` to apply `disable_health_check: true` via a shared `apply_test_defaults()` helper across all 11 builder methods (eliminates duplication) - Updated 16 inline `RouterConfig::builder()` configs across reliability, routing, api test files to set `disable_health_check` - Added `test_replace_preserves_status` regression test All 981 tests pass (513 lib + 468 integration). Refs: PR #1102 review feedback Signed-off-by: Simo Lin <linsimo.mark@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e2c8b68749

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

model_gateway/src/worker/registry.rs

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

model_gateway/tests/common/mod.rs (1)
417-420: ⚠️ Potential issue | 🟠 Major

Respect caller health-check config in shared create_test_context* helpers.

At Line 417, Line 556, and Line 694, the helpers still hardcode disable_health_check: true for OpenAI-mode worker registration. That overrides caller intent and prevents tests using these lower-level builders from validating Pending/Failed startup semantics.
💡 Suggested fix
-                        .health_config(openai_protocol::worker::HealthCheckConfig {
-                            disable_health_check: true,
-                            ..Default::default()
-                        })
+                        .health_config(config.health_check.clone())
Apply the same replacement in all three helper functions:

create_test_context

create_test_context_with_parsers

create_test_context_with_mcp_config
Also applies to: 556-559, 694-697
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@model_gateway/tests/common/mod.rs` around lines 417 - 420, The helpers
create_test_context, create_test_context_with_parsers, and
create_test_context_with_mcp_config currently force
openai_protocol::worker::HealthCheckConfig { disable_health_check: true } when
registering the OpenAI-mode worker; update each helper to respect the
caller-provided HealthCheckConfig instead of hardcoding
disable_health_check=true — e.g., use the passed health_config (or merge with
Default::default()) when building the worker registration rather than overriding
disable_health_check, and remove the hardcoded disable_health_check assignment
so tests can control Pending/Failed startup semantics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/worker/registry.rs`:
- Around line 468-474: Currently the code unconditionally copies
old_worker.status() into new_worker
(new_worker.set_status(old_worker.status())), which preserves Failed or
Pending/NotReady states and breaks recovery; change this to copy the old status
only when it's Healthy (or when it's Pending/NotReady and the new worker will
still be health-checked). Concretely, replace the unconditional set_status call
with a conditional: if old_worker.status().is_healthy() then
new_worker.set_status(old_worker.status()); else if
old_worker.status().is_pending_or_not_ready() &&
!new_worker.config().disable_health_check {
new_worker.set_status(old_worker.status()); } otherwise leave the new worker at
its builder-initialized status so check_health_async() and explicit replacement
logic can recover Failed and handle disabled health checks correctly; reference
new_worker.set_status, old_worker.status(), check_health_async(), and
disable_health_check in your change.

In `@model_gateway/src/worker/worker.rs`:
- Around line 627-639: worker_to_info() currently computes WorkerInfo.status via
is_healthy(), which hides the actual WorkerStatus (Pending, NotReady, Failed,
Ready); change worker_to_info() to read the real status (the WorkerStatus stored
on the worker) and map that enum directly into the WorkerInfo.status field
instead of calling is_healthy(). Update any other serialization logic in the
same file (the block covering the later mapping around the 646-675 region) to
convert WorkerStatus -> WorkerInfo status values so Pending and Failed are
preserved when returned to API/TUI consumers.
- Around line 608-620: The Pending-probe cap is only enforced when a probe
fails, allowing a successful probe to bypass max_pending_probes; after computing
probe_result/health_result (from http_health_check()/grpc_health_check()) and
regardless of true/false, check whether current_status == WorkerStatus::Pending
and self.total_pending_probes.fetch_add(1, Ordering::Relaxed) (or the
already-incremented counter) has reached or exceeded self.max_pending_probes; if
so, transition the worker to Failed (same behavior as the failure arm) and
short-circuit further success/failure handling. Apply the same change to the
other probe blocks referenced (the regions around where health_result is
evaluated at 622-639 and 662-675) so the max_pending_probes check runs
unconditionally after each Pending probe result.

---

Duplicate comments:
In `@model_gateway/tests/common/mod.rs`:
- Around line 417-420: The helpers create_test_context,
create_test_context_with_parsers, and create_test_context_with_mcp_config
currently force openai_protocol::worker::HealthCheckConfig {
disable_health_check: true } when registering the OpenAI-mode worker; update
each helper to respect the caller-provided HealthCheckConfig instead of
hardcoding disable_health_check=true — e.g., use the passed health_config (or
merge with Default::default()) when building the worker registration rather than
overriding disable_health_check, and remove the hardcoded disable_health_check
assignment so tests can control Pending/Failed startup semantics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 91ef9f22-2e0b-4d97-aea6-c8aa8c779a65

📥 Commits

Reviewing files that changed from the base of the PR and between c88c26e and e2c8b68.

📒 Files selected for processing (15)

model_gateway/src/worker/registry.rs
model_gateway/src/worker/worker.rs
model_gateway/src/workflow/steps/shared/activate.rs
model_gateway/tests/api/api_endpoints_test.rs
model_gateway/tests/api/parser_endpoints_test.rs
model_gateway/tests/common/mod.rs
model_gateway/tests/common/test_config.rs
model_gateway/tests/reliability/circuit_breaker_test.rs
model_gateway/tests/reliability/rate_limiting_test.rs
model_gateway/tests/reliability/retries_test.rs
model_gateway/tests/routing/header_forwarding_test.rs
model_gateway/tests/routing/payload_size_test.rs
model_gateway/tests/routing/pd_routing_test.rs
model_gateway/tests/routing/power_of_two_test.rs
model_gateway/tests/routing/service_discovery_test.rs

model_gateway/src/worker/registry.rs

model_gateway/src/worker/worker.rs

The previous fix put status-preservation logic inside `WorkerRegistry::replace()`, which conflated two concerns: the registry's job is to manage the collection, not to decide what status a new worker should have. It also meant callers couldn't override — if a caller actually wanted the new worker to start Pending (e.g. after a config change that should re-validate health), they couldn't. This commit moves the responsibility to the right layer: 1. **Builder gets a `.status()` setter** (worker/builder.rs): `BasicWorkerBuilder::status(WorkerStatus)` lets callers explicitly set the initial status. Without it, the default is Pending for health-checked workers and Ready for `disable_health_check == true` (unchanged). 2. **`UpdateWorkerPropertiesStep` passes the old worker's status** (workflow/steps/local/update_worker_properties.rs): metadata-only updates read `worker.status()` and pass it via `.status(...)` so the replacement inherits the prior state. No more 503s while a healthy worker re-proves itself after a label change. 3. **Removed status preservation from `WorkerRegistry::replace()`** (worker/registry.rs): the registry is back to being a pure collection. Callers control the worker's initial state via the builder. Tests: - Renamed `test_replace_preserves_status` → `test_builder_status_override_on_replace` to reflect that the test now exercises the builder's `.status()` setter. - Added `test_builder_default_status_is_pending` to lock in the default. Why this is better: - Single responsibility: registry manages collection, builder constructs workers. - Caller control: tests and call sites can opt into either Pending or any other status as needed. - Standard builder pattern: defaults are sensible, overrides are explicit. All 982 tests pass. Refs: PR #1102 review feedback (chatgpt-codex-connector P1) Signed-off-by: Simo Lin <linsimo.mark@gmail.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@model_gateway/src/workflow/steps/local/update_worker_properties.rs`:
- Around line 90-91: Don't blindly copy worker.status() into the update builder;
instead compute a safe new status using openai_protocol::worker::WorkerStatus
and pass that to .status(...). Specifically, import WorkerStatus, read let prev
= worker.status(), then set let new_status = if prev == WorkerStatus::Failed {
WorkerStatus::NotReady } else if config.disable_health_check && (prev ==
WorkerStatus::Pending || prev == WorkerStatus::NotReady) { WorkerStatus::Ready }
else { prev }; and use .status(new_status) in place of .status(worker.status())
so replacements don't inherit terminal Failed or remain unroutable when health
checks are disabled.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8d0024a1-bbb3-4319-bd99-e9fe3da340b6

📥 Commits

Reviewing files that changed from the base of the PR and between e2c8b68 and cec40bf.

📒 Files selected for processing (3)

model_gateway/src/worker/builder.rs
model_gateway/src/worker/registry.rs
model_gateway/src/workflow/steps/local/update_worker_properties.rs

model_gateway/src/workflow/steps/local/update_worker_properties.rs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cec40bf1d9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

model_gateway/src/workflow/steps/local/update_worker_properties.rs

The previous fix used `unwrap_or(false)` which silently discarded the error details. Operators debugging why a worker won't come Ready had no visibility into the underlying cause (connection refused, TLS failure, DNS resolution failure, timeout, etc.). Switched to `unwrap_or_else` which logs the error with the worker URL at warn level before returning false. This preserves the existing state-machine behavior (transport errors still count as failed probes and feed the Pending timeout / NotReady→Failed paths) while making the underlying cause discoverable from logs. Refs: PR #1102 review feedback Signed-off-by: Simo Lin <linsimo.mark@gmail.com>

… rot Self-review of PR 6b found two issues that needed fixing: 1. **Test rot**: `test_health_checker_removes_unhealthy_workers` was asserting `healthy_workers == 0` after failed probes. With the new Pending start semantics, that assertion is true from the moment of registration (Pending workers are not healthy by definition), so the test passed vacuously without ever exercising removal or any state transition. Replaced with two focused tests: - `test_pending_worker_stays_pending_under_failed_probes`: verifies that Pending is sticky during early failures (does not transition to NotReady). - `test_failed_worker_can_be_removed_from_registry`: verifies the removal API works on workers in Failed state, without needing the async health checker loop or a JobQueue. Also renamed `test_health_checker_keeps_unhealthy_workers_when_disabled` → `test_health_checker_keeps_workers_when_remove_unhealthy_disabled` to better describe what it tests. 2. **Wasted work on Failed workers**: `check_health_async` was running the full HTTP/gRPC probe and incrementing counters for workers in the terminal Failed state. Added an early return that skips the probe and returns Err immediately. The health checker loop's removal logic is unaffected — it still gates on `worker.status() == Failed`, so removal happens on the next tick. The Failed match arm is kept (empty) because the Rust compiler requires exhaustive matching, and removing it would make adding future variants silent. All 983 tests pass. Refs: PR #1102 self-review Signed-off-by: Simo Lin <linsimo.mark@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ddc0b41131

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

model_gateway/src/workflow/steps/shared/activate.rs

… rot Self-review of PR 6b found two issues that needed fixing: 1. **Test rot**: `test_health_checker_removes_unhealthy_workers` was asserting `healthy_workers == 0` after failed probes. With the new Pending start semantics, that assertion is true from the moment of registration (Pending workers are not healthy by definition), so the test passed vacuously without ever exercising removal or any state transition. Replaced with two focused tests: - `test_pending_worker_stays_pending_under_failed_probes`: verifies that Pending is sticky during early failures (does not transition to NotReady). - `test_failed_worker_can_be_removed_from_registry`: verifies the removal API works on workers in Failed state, without needing the async health checker loop or a JobQueue. Also renamed `test_health_checker_keeps_unhealthy_workers_when_disabled` → `test_health_checker_keeps_workers_when_remove_unhealthy_disabled` to better describe what it tests. 2. **Wasted work on Failed workers**: `check_health_async` was running the full HTTP/gRPC probe and incrementing counters for workers in the terminal Failed state. Added an early return that skips the probe and returns Err immediately. The health checker loop's removal logic is unaffected — it still gates on `worker.status() == Failed`, so removal happens on the next tick. The Failed match arm is kept (empty) because the Rust compiler requires exhaustive matching, and removing it would make adding future variants silent. All 983 tests pass. Refs: PR #1102 self-review Signed-off-by: Simo Lin <linsimo.mark@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 60cca3f249

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

model_gateway/src/worker/worker.rs

Four real bugs surfaced by the latest review round: 1. **`worker_to_info()` collapses Pending/Failed to NotReady** (worker.rs): API/TUI consumers were seeing `status: NotReady` for workers in Pending or Failed because the helper computed status from `is_healthy()` (which is `status == Ready`). Now reads `worker.status()` directly and exposes the real lifecycle state. `is_healthy` is still derived from the same snapshot for backwards compatibility, no TOCTOU. 2. **`max_pending_probes` not enforced on success branch** (worker.rs): `total_pending_probes` is incremented before every Pending probe, but the cap was only checked in the failure branch. A flapping pattern (F,S,F,S,...) never reaches `success_threshold`, never transitions, and grows the counter unbounded. Added the cap check to the success branch — if a Pending worker probes successfully but has exceeded `max_pending_probes` total attempts, it transitions to Failed. 3. **`update_worker_properties.rs` blindly preserves status** (workflow/steps/local/update_worker_properties.rs): Two real bugs: - **Failed workers stay Failed**: a metadata-only update on a Failed worker would inherit the terminal status. Now compute a fresh `Pending` for Failed workers (they get a chance to recover via the standard health check loop after the update). - **disable_health_check flip leaves stale status**: if the update flips `disable_health_check` from false → true while the worker was Pending/NotReady, the new worker would be Pending forever (the health checker skips disabled workers). Now force `Ready` when `disable_health_check == true`. 4. **`ActivateWorkersStep` doesn't activate when global health checker is off** (workflow/steps/shared/activate.rs): `server.rs` skips starting the health checker entirely when global `disable_health_check` is set. Workers created with per-worker `disable_health_check: false` would stay Pending forever because no background task probes them. Now reads `app_context.router_config.health_check.disable_health_check` and force-activates all workers when the global checker won't run. Also fixed a stale comment in worker.rs that said Failed workers can recover via `register_or_replace`. With the recent changes, `replace()` preserves the old status, so the actual recovery path is removal + re-registration. All 983 tests pass, clippy clean. Refs: PR #1102 review feedback (chatgpt-codex-connector P1, coderabbitai) Signed-off-by: Simo Lin <linsimo.mark@gmail.com>

Problem: e2e tests fail with "no available workers" because workflow- created workers stay Pending until the background health checker eventually promotes them. With default config, that takes 120s (check_interval=60 × success_threshold=2), longer than the 60s test gateway timeout. Root cause analysis: my previous ActivateWorkersStep guard (only activate `disable_health_check` workers) was wrong. I missed that the workflow already does a real health probe via DetectBackendStep (HTTP /health GET or gRPC health_check). By the time ActivateWorkers runs, the worker has already been proven reachable — it's safe (and correct) to flip Pending → Ready immediately. This restores the pre-state-machine workflow behavior at the trusted- creation boundary, while keeping the state machine for ongoing failures: the background health checker still demotes workers (Ready → NotReady → Failed) if probes start failing later. Pending remains the correct start state for paths that don't go through the workflow's connectivity proof (e.g. future mesh-imported workers). Changes: - activate.rs: simplified to always call set_healthy(true). Added rationale comment explaining why this is correct (DetectBackendStep already proved reachability). - update_worker_properties.rs: simplified next_status logic. A metadata-only update preserves the worker's prior status (the endpoint hasn't changed, only labels/priority/etc.). Kept the one exception: flipping disable_health_check to true forces Ready, because the health checker skips disabled workers and a stale Pending/NotReady would never recover. All 983 tests pass, clippy clean. Refs: PR #1102 e2e test failures Signed-off-by: Simo Lin <linsimo.mark@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 042ab0ef0d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

model_gateway/src/worker/builder.rs

…health_check Workers created through the workflow already had their connectivity proven by DetectBackendStep (HTTP /health probe or gRPC health_check) before reaching ActivateWorkersStep. Flipping them Pending → Ready here is correct because the workflow's connectivity proof is functionally equivalent to one successful state-machine probe. Without this, workflow-created workers stayed Pending until the background health checker promoted them after `check_interval × success_threshold` seconds (~120s with defaults), exceeding e2e gateway startup timeouts. The state machine still applies for ongoing failures: the background health checker demotes Ready → NotReady → Failed on probe failures. Pending remains the start state for paths that don't pre-validate (e.g. mesh-imported workers in future PRs). Signed-off-by: Simo Lin <linsimo.mark@gmail.com>

slin1237 requested review from CatherineSue and key4ng as code owners April 12, 2026 02:22

github-actions bot added tests Test changes model-gateway Model gateway crate changes labels Apr 12, 2026

claude bot reviewed Apr 12, 2026

View reviewed changes

model_gateway/src/worker/worker.rs Outdated Show resolved Hide resolved

claude bot reviewed Apr 12, 2026

View reviewed changes

model_gateway/src/worker/worker.rs Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Apr 12, 2026

View reviewed changes

model_gateway/src/worker/worker.rs Outdated Show resolved Hide resolved

model_gateway/src/worker/worker.rs Outdated Show resolved Hide resolved

claude bot approved these changes Apr 12, 2026

View reviewed changes

gemini-code-assist bot reviewed Apr 12, 2026

View reviewed changes

model_gateway/src/worker/worker.rs Outdated Show resolved Hide resolved

coderabbitai bot requested changes Apr 12, 2026

View reviewed changes

slin1237 force-pushed the refactor/worker-pending-state-machine branch from 8aba682 to 53e5233 Compare April 12, 2026 04:30

chatgpt-codex-connector bot reviewed Apr 12, 2026

View reviewed changes

model_gateway/src/worker/builder.rs Outdated Show resolved Hide resolved

coderabbitai bot requested changes Apr 12, 2026

View reviewed changes

model_gateway/src/workflow/steps/shared/activate.rs Outdated Show resolved Hide resolved

claude bot reviewed Apr 12, 2026

View reviewed changes

model_gateway/src/worker/worker.rs Outdated Show resolved Hide resolved

claude bot reviewed Apr 12, 2026

View reviewed changes

model_gateway/src/worker/worker.rs Outdated Show resolved Hide resolved

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

slin1237 force-pushed the refactor/worker-pending-state-machine branch from c88c26e to e2c8b68 Compare April 12, 2026 12:56

chatgpt-codex-connector bot reviewed Apr 12, 2026

View reviewed changes

model_gateway/src/worker/registry.rs Outdated Show resolved Hide resolved

coderabbitai bot requested changes Apr 12, 2026

View reviewed changes

model_gateway/src/worker/registry.rs Outdated Show resolved Hide resolved

model_gateway/src/worker/worker.rs Outdated Show resolved Hide resolved

model_gateway/src/worker/worker.rs Show resolved Hide resolved

coderabbitai bot requested changes Apr 12, 2026

View reviewed changes

model_gateway/src/workflow/steps/local/update_worker_properties.rs Outdated Show resolved Hide resolved

claude bot reviewed Apr 12, 2026

View reviewed changes

model_gateway/src/workflow/steps/local/update_worker_properties.rs Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Apr 12, 2026

View reviewed changes

model_gateway/src/workflow/steps/local/update_worker_properties.rs Outdated Show resolved Hide resolved

slin1237 force-pushed the refactor/worker-pending-state-machine branch from 1fa4b49 to cb307ad Compare April 12, 2026 13:32

chatgpt-codex-connector bot reviewed Apr 12, 2026

View reviewed changes

model_gateway/src/workflow/steps/shared/activate.rs Outdated Show resolved Hide resolved

slin1237 force-pushed the refactor/worker-pending-state-machine branch from ddc0b41 to 60cca3f Compare April 12, 2026 14:11

chatgpt-codex-connector bot reviewed Apr 12, 2026

View reviewed changes

model_gateway/src/worker/worker.rs Show resolved Hide resolved

coderabbitai bot approved these changes Apr 12, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 12, 2026

View reviewed changes

model_gateway/src/worker/builder.rs Show resolved Hide resolved

slin1237 merged commit d954c14 into main Apr 12, 2026
79 of 81 checks passed

slin1237 deleted the refactor/worker-pending-state-machine branch April 12, 2026 20:30

Conversation

slin1237 commented Apr 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

State Machine

Changes

What's NOT in this PR

Behavior changes

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

slin1237 commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading