refactor(worker): land Pending start semantics and state machine#1102
refactor(worker): land Pending start semantics and state machine#1102
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughWorker health model changed: builder now controls initial WorkerStatus (Pending/Ready) and exposes a status setter; BasicWorker tracks pending probe counts and revised health-check transitions; registry removal logic now only removes terminal Failed status; various workflow steps and many tests updated to disable health checks in test setups. Changes
Sequence Diagram(s)sequenceDiagram
participant Workflow as Workflow/CreateWorker
participant Builder as BasicWorkerBuilder
participant Worker as BasicWorker
participant Registry as WorkerRegistry
participant Health as HealthChecker
Workflow->>Builder: configure metadata (+health_config, optional status)
Builder->>Worker: build() -> set initial WorkerStatus (Ready if disable_health_check, else Pending)
Worker->>Registry: register(worker snapshot)
par async health lifecycle
Health->>Worker: check_health_async()
alt health checks disabled
Worker-->>Registry: remain Ready (no probes)
else enabled
Worker-->>Worker: increment total_pending_probes if Pending
Worker-->>Registry: update status transitions (Pending->Ready, Ready->NotReady, NotReady->Failed)
end
end
Note right of Registry: removal only when status == Failed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8aba682681
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request replaces the boolean health status for workers with a state machine consisting of Pending, Ready, NotReady, and Failed states. Workers with health checks enabled now initialize as Pending and transition to Ready only after meeting a success threshold. The update also introduces logic for handling failures, including a transition to Failed after exceeding failure or probe thresholds. Feedback was provided regarding the Failed state, noting that the current implementation allows recovery to Ready, which contradicts the documentation suggesting Failed should be a terminal state.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/policies/bucket.rs`:
- Around line 570-575: The test helper no_health_check() is duplicated across
test modules; extract it into a shared test utilities module (e.g., a new
tests::helpers or crate::test_utils) and update tests to import and call that
single function instead of their local copies. Move the function signature and
return value (HealthCheckConfig with disable_health_check: true and
Default::default() for the rest) into the shared helper, ensure the module is
annotated for test-only use (cfg(test) or in tests/), and update references in
policies::bucket tests to use the centralized no_health_check helper.
In `@model_gateway/src/policies/mod.rs`:
- Around line 190-195: The test helper no_health_check() that returns a
HealthCheckConfig with disable_health_check: true is duplicated across policy
test modules; extract and centralize it into a shared test utilities module
(e.g., tests::utils or a new policies::test_helpers) and update callers to
import and use the single no_health_check function instead of local copies,
ensuring the signature and returned type (HealthCheckConfig) remain unchanged so
functions referencing no_health_check() continue to compile.
In `@model_gateway/src/worker/registry.rs`:
- Around line 996-1001: The worker eviction currently checks both
worker.status() == WorkerStatus::Failed and the boolean gate *failed from
checked_workers, which prevents removing workers that transitioned to Failed
without an Err probe; update the eviction logic in registry.rs (the loop
iterating over checked_workers when job_queue.is_some()) to only rely on
worker.status() == WorkerStatus::Failed (i.e., drop the extra *failed condition)
so that any worker in the terminal Failed state is removed when
--remove-unhealthy-workers is enabled.
In `@model_gateway/src/worker/worker.rs`:
- Around line 654-661: The Pending→Failed branch doesn't reset the
consecutive_failures counter, unlike the other failure transitions; after
detecting total >= max_pending and calling self.set_status(WorkerStatus::Failed)
in the WorkerStatus::Pending branch, also reset the counter (e.g.,
self.consecutive_failures.store(0, Ordering::Relaxed)) so consecutive_failures
is cleared consistently with the Ready→NotReady and NotReady→Failed transitions;
locate this in the block that computes max_pending from
health_config.failure_threshold and reads self.total_pending_probes.
In `@model_gateway/tests/common/mod.rs`:
- Around line 206-207: The test helper is forcibly disabling health checks by
setting config.health_check.disable_health_check = true, which prevents callers
from enabling health checks; change this so only AppTestContext::new() defaults
to disabling health checks, but do not mutate the passed config in
new_with_config() or the lower-level context builders—respect the incoming
config.health_check values instead (update new_with_config(), related builders,
and any helper locations around the places called out: the blocks near lines
206, 416-419, 555-558, 693-696) so tests that pass an enabled health_check can
exercise Pending/Failed semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d7bd91a0-dc16-48a9-b8f1-b5843873c930
📒 Files selected for processing (23)
model_gateway/src/policies/bucket.rsmodel_gateway/src/policies/cache_aware.rsmodel_gateway/src/policies/consistent_hashing.rsmodel_gateway/src/policies/manual.rsmodel_gateway/src/policies/mod.rsmodel_gateway/src/policies/power_of_two.rsmodel_gateway/src/policies/prefix_hash.rsmodel_gateway/src/policies/random.rsmodel_gateway/src/policies/registry.rsmodel_gateway/src/policies/round_robin.rsmodel_gateway/src/routers/http/router.rsmodel_gateway/src/worker/builder.rsmodel_gateway/src/worker/registry.rsmodel_gateway/src/worker/worker.rsmodel_gateway/src/workflow/steps/external/create_workers.rsmodel_gateway/src/workflow/steps/local/create_worker.rsmodel_gateway/src/workflow/steps/shared/activate.rsmodel_gateway/tests/common/mod.rsmodel_gateway/tests/common/test_app.rsmodel_gateway/tests/load_guard_raii_test.rsmodel_gateway/tests/otel_tracing_test.rsmodel_gateway/tests/routing/cache_aware_backward_compat_test.rsmodel_gateway/tests/routing/test_openai_routing.rs
What changed:
Production code:
- worker.rs: added `total_pending_probes: Arc<AtomicUsize>` to BasicWorker
for tracking total probe attempts in Pending state
- builder.rs: initializes workers as `Pending` when health checks are
enabled, `Ready` when `disable_health_check == true`. Previously all
workers started Ready regardless of config.
- worker.rs: `check_health_async()` implements the full state machine:
Pending → Ready on `success_threshold` consecutive passes
Pending → Failed on `max_pending_probes` (10× failure_threshold) total attempts
Ready → NotReady on `failure_threshold` consecutive failures
NotReady → Ready on `success_threshold` consecutive passes
NotReady → Failed on `liveness_failure_threshold` (3× failure_threshold)
- registry.rs: `start_health_checker` only removes `Failed` workers
(not Pending or NotReady) when `--remove-unhealthy-workers` is set
- workflow/steps/shared/activate.rs: ActivateWorkersStep only activates
workers with `disable_health_check == true`. Health-checked workers
stay Pending — the health checker promotes them after passing probes.
- workflow/steps/local/create_worker.rs: removed redundant set_healthy
call (builder now handles initial status)
- workflow/steps/external/create_workers.rs: same removal
Tests:
- worker.rs: added `no_health_check()` helper and `test_pending_worker_not_routable`
- builder.rs: updated assertions to match new Pending-by-default behavior
- policies/*.rs: added `.health_config(no_health_check())` to ~73 test
builders so policy tests get routable workers (they don't test health
lifecycle, they test routing decisions)
- routers/http/router.rs: same fix for router tests
- tests/common/mod.rs: `new_with_config` sets `disable_health_check: true`
so all integration tests get routable workers immediately
- tests/common/test_app.rs: same for external worker helper
- tests/routing/cache_aware_backward_compat_test.rs: added no_health_check
helper and applied to 8 builders
- tests/routing/test_openai_routing.rs: added health config
- tests/load_guard_raii_test.rs: added health config
- tests/otel_tracing_test.rs: set disable_health_check on test config
Why:
Workers that haven't been verified locally should not receive traffic.
Starting Pending instead of Ready prevents the mesh-sync 500/503 bug
where unverified workers were assumed healthy. The state machine
(Pending → Ready → NotReady → Failed) mirrors Kubernetes pod conditions
and enables proper liveness separation from readiness.
How:
- `max_pending_probes = 10 × failure_threshold` guards against misconfigured
URLs staying in Pending forever
- `liveness_failure_threshold = 3 × failure_threshold` prevents aggressive
removal of transiently unhealthy workers (analogous to K8s having
separate readiness and liveness probes)
- `StatusChanged` events are NOT yet emitted from the health path — this
is deferred to PR 7 when WorkerManager owns the health loop
- `--remove-unhealthy-workers` now only removes Failed workers, not
NotReady (which may recover) or Pending (which hasn't proven itself).
This is a behavior fix, not a breaking change — the old behavior was
a bug that removed transiently unhealthy workers too aggressively.
All 512 lib tests, 100 api tests, and 200+ other integration tests pass.
Refs: worker-module-deep-refactor plan (PR 6b)
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
8aba682 to
53e5233
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 53e5233f00
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
model_gateway/src/worker/registry.rs (1)
1000-1000:⚠️ Potential issue | 🟠 MajorDrop the extra
*failedgate from terminal Failed eviction.Line 1000 currently requires both
WorkerStatus::Failedand*failed. A worker can already be terminalFailedwhile the latest probe result is notErr, which prevents eviction and leaves a non-recoverable worker registered.🛠️ Proposed fix
- if worker.status() == WorkerStatus::Failed && *failed { + if worker.status() == WorkerStatus::Failed {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/worker/registry.rs` at line 1000, The eviction gate currently requires both worker.status() == WorkerStatus::Failed and *failed, which prevents eviction when a worker is already terminal Failed but the latest probe flag (failed) is false; remove the extra probe-based check by changing the conditional that references worker.status() and *failed to only check worker.status() == WorkerStatus::Failed so terminal Failed workers are evicted regardless of the transient *failed probe flag (adjust any surrounding logic that depended on the combined condition in the same block to keep semantics for non-Failed states).model_gateway/tests/common/mod.rs (1)
206-207:⚠️ Potential issue | 🟠 MajorDon’t override caller-supplied health-check config in shared test helpers.
new_with_config()now forcesconfig.health_check.disable_health_check = true, and the lower-level context builders hardcode the same override into every test worker. That means any test that explicitly enables health checks can no longer exercise the newPending -> Ready/Failedstartup behavior this PR is introducing. It’s fine forAppTestContext::new()to default to disabled health checks for convenience, butnew_with_config()andcreate_test_context*()should preserve the incoming config.Suggested direction
pub fn new( worker_configs: Vec<MockWorkerConfig>, ) -> Pin<Box<dyn Future<Output = Self> + Send>> { Box::pin(async move { - let config = RouterConfig::builder() + let mut config = RouterConfig::builder() .regular_mode(vec![]) .random_policy() .host("127.0.0.1") .port(3002) .max_payload_size(256 * 1024 * 1024) .request_timeout_secs(600) .worker_startup_timeout_secs(1) .worker_startup_check_interval_secs(1) .max_concurrent_requests(64) .queue_timeout_secs(60) .build_unchecked(); + config.health_check.disable_health_check = true; Self::new_with_config(config, worker_configs).await }) } pub fn new_with_config( - mut config: RouterConfig, + config: RouterConfig, worker_configs: Vec<MockWorkerConfig>, ) -> Pin<Box<dyn Future<Output = Self> + Send>> { - // Test workers don't need health checks — start Ready immediately - config.health_check.disable_health_check = true; - Box::pin(async move { ... }) }And in each
create_test_context*()helper, pass through the caller's setting instead of hardcoding it:+let health_config = config.health_check.clone(); ... - .health_config(openai_protocol::worker::HealthCheckConfig { - disable_health_check: true, - ..Default::default() - }) + .health_config(health_config.clone())Also applies to: 416-419, 555-558, 693-696
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/tests/common/mod.rs` around lines 206 - 207, new_with_config() and the create_test_context*() helpers are mutating the caller-supplied config by forcing config.health_check.disable_health_check = true, which prevents tests from exercising the Pending->Ready/Failed startup flow; remove that hardcoded override from new_with_config() and from the lower-level create_test_context*() builders so they preserve the incoming config, leaving only AppTestContext::new() to default-disable health checks for convenience if no config is provided; ensure any place that currently sets config.health_check.disable_health_check explicitly instead respects the passed-in config value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/workflow/steps/shared/activate.rs`:
- Around line 28-30: The Ready transition is being re-applied for workers with
disable_health_check by calling worker.set_healthy(true) in the activation loop;
instead either remove that call or guard it so you don't mutate
already-initialized status—update the block around
worker.metadata().health_config.disable_health_check to simply increment
activated (or only call worker.set_healthy(true) when the worker's current state
is not already Ready) and avoid redundant state mutations from
set_healthy(true).
---
Duplicate comments:
In `@model_gateway/src/worker/registry.rs`:
- Line 1000: The eviction gate currently requires both worker.status() ==
WorkerStatus::Failed and *failed, which prevents eviction when a worker is
already terminal Failed but the latest probe flag (failed) is false; remove the
extra probe-based check by changing the conditional that references
worker.status() and *failed to only check worker.status() ==
WorkerStatus::Failed so terminal Failed workers are evicted regardless of the
transient *failed probe flag (adjust any surrounding logic that depended on the
combined condition in the same block to keep semantics for non-Failed states).
In `@model_gateway/tests/common/mod.rs`:
- Around line 206-207: new_with_config() and the create_test_context*() helpers
are mutating the caller-supplied config by forcing
config.health_check.disable_health_check = true, which prevents tests from
exercising the Pending->Ready/Failed startup flow; remove that hardcoded
override from new_with_config() and from the lower-level create_test_context*()
builders so they preserve the incoming config, leaving only
AppTestContext::new() to default-disable health checks for convenience if no
config is provided; ensure any place that currently sets
config.health_check.disable_health_check explicitly instead respects the
passed-in config value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e122f3b6-20af-446b-9c25-1d39b0300802
📒 Files selected for processing (23)
model_gateway/src/policies/bucket.rsmodel_gateway/src/policies/cache_aware.rsmodel_gateway/src/policies/consistent_hashing.rsmodel_gateway/src/policies/manual.rsmodel_gateway/src/policies/mod.rsmodel_gateway/src/policies/power_of_two.rsmodel_gateway/src/policies/prefix_hash.rsmodel_gateway/src/policies/random.rsmodel_gateway/src/policies/registry.rsmodel_gateway/src/policies/round_robin.rsmodel_gateway/src/routers/http/router.rsmodel_gateway/src/worker/builder.rsmodel_gateway/src/worker/registry.rsmodel_gateway/src/worker/worker.rsmodel_gateway/src/workflow/steps/external/create_workers.rsmodel_gateway/src/workflow/steps/local/create_worker.rsmodel_gateway/src/workflow/steps/shared/activate.rsmodel_gateway/tests/common/mod.rsmodel_gateway/tests/common/test_app.rsmodel_gateway/tests/load_guard_raii_test.rsmodel_gateway/tests/otel_tracing_test.rsmodel_gateway/tests/routing/cache_aware_backward_compat_test.rsmodel_gateway/tests/routing/test_openai_routing.rs
Production fixes: 1. **Failed is terminal** (worker.rs): The success branch used `current_status != Ready` which also matched `Failed`, allowing silently reintroducing Failed workers via successful probes. Changed to `matches!(.., Pending | NotReady)` so Failed workers stay Failed until explicitly replaced. 2. **gRPC connect errors trigger Pending timeout** (worker.rs): probe results now use `unwrap_or(false)` instead of `?`, so transport-level errors (e.g. gRPC connection failures) increment counters and reach the Pending→Failed timeout path. Without this, misconfigured gRPC workers stayed Pending indefinitely. 3. **Drop `*failed` gate from removal** (registry.rs): Pending→Failed via `max_pending_probes` can happen on a probe that returned `Ok(false)` (not `Err`). The previous gate would skip removal in that case. Failed is terminal — any worker in Failed state is removed when `--remove-unhealthy-workers` is enabled. 4. **Preserve status on replace** (registry.rs): metadata-only updates (e.g. priority, labels) used to drop a healthy worker back to Pending because the builder always starts health-checked workers as Pending. This caused ~30s of avoidable 503s for single-worker models. `replace()` now copies the old worker's status to the new worker before insertion. Added regression test. 5. **`new_with_config` no longer mutates incoming config** (tests/common/mod.rs): Tests that need to exercise Pending/Failed semantics can now pass a config with health checks enabled. `AppTestContext::new()` and `TestRouterConfig::*` still default to `disable_health_check: true` for tests that don't care. Nit fixes: 6. **Reset consecutive_failures on Pending→Failed** (worker.rs): consistency with other failure-path transitions. 7. **Guard set_healthy in ActivateWorkersStep** (activate.rs): only call set_healthy if not already Ready, to avoid extra state churn. Test fixes: - Updated `TestRouterConfig` to apply `disable_health_check: true` via a shared `apply_test_defaults()` helper across all 11 builder methods (eliminates duplication) - Updated 16 inline `RouterConfig::builder()` configs across reliability, routing, api test files to set `disable_health_check` - Added `test_replace_preserves_status` regression test All 981 tests pass (513 lib + 468 integration). Refs: PR #1102 review feedback Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
model_gateway/src/worker/worker.rs (1)
596-685:⚠️ Potential issue | 🟠 Major
Pending/Failedstill get flattened toNotReadyinWorkerInfo.This state machine now makes
PendingandFailedreal runtime states, butworker_to_info()still derivesstatusfromis_healthy(). Any caller ofworker_to_info()will therefore never seePendingorFailed, which defeats the new lifecycle semantics outside the in-memory worker object.♻️ Suggested fix
pub fn worker_to_info(worker: &Arc<dyn Worker>) -> WorkerInfo { let metadata = worker.metadata(); let spec = metadata.spec.clone(); let is_healthy = worker.is_healthy(); + let status = worker.status(); WorkerInfo { id: worker.url().to_string(), model_id: spec.models.primary().map(|m| m.id.clone()), spec, is_healthy, - status: Some(if is_healthy { - WorkerStatus::Ready - } else { - WorkerStatus::NotReady - }), + status: Some(status), load: worker.load(), job_status: None, } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/src/worker/worker.rs` around lines 596 - 685, worker_to_info currently derives the public status from is_healthy(), which collapses Pending/Failed into NotReady; update worker_to_info to read the worker's actual runtime status (use the Worker::status() method or the stored metadata.status field) and map that WorkerStatus value directly into WorkerInfo.status so Pending and Failed are preserved for callers; ensure any existing boolean is_healthy logic is kept for health checks only and adjust the mapping code in worker_to_info (and any helper that converts WorkerStatus→WorkerInfo status) to accept/propagate the full WorkerStatus enum values instead of relying on is_healthy().
♻️ Duplicate comments (1)
model_gateway/tests/common/mod.rs (1)
417-420:⚠️ Potential issue | 🟠 MajorRespect caller-provided health-check config in lower-level OpenAI test context builders.
Line 417, Line 556, and Line 694 hardcode
disable_health_check: truefor external workers, which overrides explicit test configs and prevents OpenAI-mode tests from exercising Pending/Failed startup semantics when needed.♻️ Proposed fix
- .health_config(openai_protocol::worker::HealthCheckConfig { - disable_health_check: true, - ..Default::default() - }) + .health_config(config.health_check.clone())Also applies to: 556-559, 694-697
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/tests/common/mod.rs` around lines 417 - 420, The builders are forcing disable_health_check: true in calls to health_config(openai_protocol::worker::HealthCheckConfig { disable_health_check: true, ..Default::default() }) which overrides any caller-provided test config; update those spots to respect the caller's HealthCheckConfig (or fall back to Default::default()) instead of hardcoding disable_health_check, i.e., pass through the provided HealthCheckConfig value into health_config or merge it with defaults so disable_health_check reflects the caller's setting (affects the occurrences that use openai_protocol::worker::HealthCheckConfig and the health_config(...) calls).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@model_gateway/src/worker/worker.rs`:
- Around line 596-685: worker_to_info currently derives the public status from
is_healthy(), which collapses Pending/Failed into NotReady; update
worker_to_info to read the worker's actual runtime status (use the
Worker::status() method or the stored metadata.status field) and map that
WorkerStatus value directly into WorkerInfo.status so Pending and Failed are
preserved for callers; ensure any existing boolean is_healthy logic is kept for
health checks only and adjust the mapping code in worker_to_info (and any helper
that converts WorkerStatus→WorkerInfo status) to accept/propagate the full
WorkerStatus enum values instead of relying on is_healthy().
---
Duplicate comments:
In `@model_gateway/tests/common/mod.rs`:
- Around line 417-420: The builders are forcing disable_health_check: true in
calls to health_config(openai_protocol::worker::HealthCheckConfig {
disable_health_check: true, ..Default::default() }) which overrides any
caller-provided test config; update those spots to respect the caller's
HealthCheckConfig (or fall back to Default::default()) instead of hardcoding
disable_health_check, i.e., pass through the provided HealthCheckConfig value
into health_config or merge it with defaults so disable_health_check reflects
the caller's setting (affects the occurrences that use
openai_protocol::worker::HealthCheckConfig and the health_config(...) calls).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1f4dfc2c-510e-4d2a-8824-b0b1e9469b22
📒 Files selected for processing (15)
model_gateway/src/worker/registry.rsmodel_gateway/src/worker/worker.rsmodel_gateway/src/workflow/steps/shared/activate.rsmodel_gateway/tests/api/api_endpoints_test.rsmodel_gateway/tests/api/parser_endpoints_test.rsmodel_gateway/tests/common/mod.rsmodel_gateway/tests/common/test_config.rsmodel_gateway/tests/reliability/circuit_breaker_test.rsmodel_gateway/tests/reliability/rate_limiting_test.rsmodel_gateway/tests/reliability/retries_test.rsmodel_gateway/tests/routing/header_forwarding_test.rsmodel_gateway/tests/routing/payload_size_test.rsmodel_gateway/tests/routing/pd_routing_test.rsmodel_gateway/tests/routing/power_of_two_test.rsmodel_gateway/tests/routing/service_discovery_test.rs
Production fixes: 1. **Failed is terminal** (worker.rs): The success branch used `current_status != Ready` which also matched `Failed`, allowing silently reintroducing Failed workers via successful probes. Changed to `matches!(.., Pending | NotReady)` so Failed workers stay Failed until explicitly replaced. 2. **gRPC connect errors trigger Pending timeout** (worker.rs): probe results now use `unwrap_or(false)` instead of `?`, so transport-level errors (e.g. gRPC connection failures) increment counters and reach the Pending→Failed timeout path. Without this, misconfigured gRPC workers stayed Pending indefinitely. 3. **Drop `*failed` gate from removal** (registry.rs): Pending→Failed via `max_pending_probes` can happen on a probe that returned `Ok(false)` (not `Err`). The previous gate would skip removal in that case. Failed is terminal — any worker in Failed state is removed when `--remove-unhealthy-workers` is enabled. 4. **Preserve status on replace** (registry.rs): metadata-only updates (e.g. priority, labels) used to drop a healthy worker back to Pending because the builder always starts health-checked workers as Pending. This caused ~30s of avoidable 503s for single-worker models. `replace()` now copies the old worker's status to the new worker before insertion. Added regression test. 5. **`new_with_config` no longer mutates incoming config** (tests/common/mod.rs): Tests that need to exercise Pending/Failed semantics can now pass a config with health checks enabled. `AppTestContext::new()` and `TestRouterConfig::*` still default to `disable_health_check: true` for tests that don't care. Nit fixes: 6. **Reset consecutive_failures on Pending→Failed** (worker.rs): consistency with other failure-path transitions. 7. **Guard set_healthy in ActivateWorkersStep** (activate.rs): only call set_healthy if not already Ready, to avoid extra state churn. Test fixes: - Updated `TestRouterConfig` to apply `disable_health_check: true` via a shared `apply_test_defaults()` helper across all 11 builder methods (eliminates duplication) - Updated 16 inline `RouterConfig::builder()` configs across reliability, routing, api test files to set `disable_health_check` - Added `test_replace_preserves_status` regression test All 981 tests pass (513 lib + 468 integration). Refs: PR #1102 review feedback Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
c88c26e to
e2c8b68
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e2c8b68749
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
model_gateway/tests/common/mod.rs (1)
417-420:⚠️ Potential issue | 🟠 MajorRespect caller health-check config in shared
create_test_context*helpers.At Line 417, Line 556, and Line 694, the helpers still hardcode
disable_health_check: truefor OpenAI-mode worker registration. That overrides caller intent and prevents tests using these lower-level builders from validating Pending/Failed startup semantics.💡 Suggested fix
- .health_config(openai_protocol::worker::HealthCheckConfig { - disable_health_check: true, - ..Default::default() - }) + .health_config(config.health_check.clone())Apply the same replacement in all three helper functions:
create_test_contextcreate_test_context_with_parserscreate_test_context_with_mcp_configAlso applies to: 556-559, 694-697
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@model_gateway/tests/common/mod.rs` around lines 417 - 420, The helpers create_test_context, create_test_context_with_parsers, and create_test_context_with_mcp_config currently force openai_protocol::worker::HealthCheckConfig { disable_health_check: true } when registering the OpenAI-mode worker; update each helper to respect the caller-provided HealthCheckConfig instead of hardcoding disable_health_check=true — e.g., use the passed health_config (or merge with Default::default()) when building the worker registration rather than overriding disable_health_check, and remove the hardcoded disable_health_check assignment so tests can control Pending/Failed startup semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/worker/registry.rs`:
- Around line 468-474: Currently the code unconditionally copies
old_worker.status() into new_worker
(new_worker.set_status(old_worker.status())), which preserves Failed or
Pending/NotReady states and breaks recovery; change this to copy the old status
only when it's Healthy (or when it's Pending/NotReady and the new worker will
still be health-checked). Concretely, replace the unconditional set_status call
with a conditional: if old_worker.status().is_healthy() then
new_worker.set_status(old_worker.status()); else if
old_worker.status().is_pending_or_not_ready() &&
!new_worker.config().disable_health_check {
new_worker.set_status(old_worker.status()); } otherwise leave the new worker at
its builder-initialized status so check_health_async() and explicit replacement
logic can recover Failed and handle disabled health checks correctly; reference
new_worker.set_status, old_worker.status(), check_health_async(), and
disable_health_check in your change.
In `@model_gateway/src/worker/worker.rs`:
- Around line 627-639: worker_to_info() currently computes WorkerInfo.status via
is_healthy(), which hides the actual WorkerStatus (Pending, NotReady, Failed,
Ready); change worker_to_info() to read the real status (the WorkerStatus stored
on the worker) and map that enum directly into the WorkerInfo.status field
instead of calling is_healthy(). Update any other serialization logic in the
same file (the block covering the later mapping around the 646-675 region) to
convert WorkerStatus -> WorkerInfo status values so Pending and Failed are
preserved when returned to API/TUI consumers.
- Around line 608-620: The Pending-probe cap is only enforced when a probe
fails, allowing a successful probe to bypass max_pending_probes; after computing
probe_result/health_result (from http_health_check()/grpc_health_check()) and
regardless of true/false, check whether current_status == WorkerStatus::Pending
and self.total_pending_probes.fetch_add(1, Ordering::Relaxed) (or the
already-incremented counter) has reached or exceeded self.max_pending_probes; if
so, transition the worker to Failed (same behavior as the failure arm) and
short-circuit further success/failure handling. Apply the same change to the
other probe blocks referenced (the regions around where health_result is
evaluated at 622-639 and 662-675) so the max_pending_probes check runs
unconditionally after each Pending probe result.
---
Duplicate comments:
In `@model_gateway/tests/common/mod.rs`:
- Around line 417-420: The helpers create_test_context,
create_test_context_with_parsers, and create_test_context_with_mcp_config
currently force openai_protocol::worker::HealthCheckConfig {
disable_health_check: true } when registering the OpenAI-mode worker; update
each helper to respect the caller-provided HealthCheckConfig instead of
hardcoding disable_health_check=true — e.g., use the passed health_config (or
merge with Default::default()) when building the worker registration rather than
overriding disable_health_check, and remove the hardcoded disable_health_check
assignment so tests can control Pending/Failed startup semantics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 91ef9f22-2e0b-4d97-aea6-c8aa8c779a65
📒 Files selected for processing (15)
model_gateway/src/worker/registry.rsmodel_gateway/src/worker/worker.rsmodel_gateway/src/workflow/steps/shared/activate.rsmodel_gateway/tests/api/api_endpoints_test.rsmodel_gateway/tests/api/parser_endpoints_test.rsmodel_gateway/tests/common/mod.rsmodel_gateway/tests/common/test_config.rsmodel_gateway/tests/reliability/circuit_breaker_test.rsmodel_gateway/tests/reliability/rate_limiting_test.rsmodel_gateway/tests/reliability/retries_test.rsmodel_gateway/tests/routing/header_forwarding_test.rsmodel_gateway/tests/routing/payload_size_test.rsmodel_gateway/tests/routing/pd_routing_test.rsmodel_gateway/tests/routing/power_of_two_test.rsmodel_gateway/tests/routing/service_discovery_test.rs
The previous fix put status-preservation logic inside `WorkerRegistry::replace()`, which conflated two concerns: the registry's job is to manage the collection, not to decide what status a new worker should have. It also meant callers couldn't override — if a caller actually wanted the new worker to start Pending (e.g. after a config change that should re-validate health), they couldn't. This commit moves the responsibility to the right layer: 1. **Builder gets a `.status()` setter** (worker/builder.rs): `BasicWorkerBuilder::status(WorkerStatus)` lets callers explicitly set the initial status. Without it, the default is Pending for health-checked workers and Ready for `disable_health_check == true` (unchanged). 2. **`UpdateWorkerPropertiesStep` passes the old worker's status** (workflow/steps/local/update_worker_properties.rs): metadata-only updates read `worker.status()` and pass it via `.status(...)` so the replacement inherits the prior state. No more 503s while a healthy worker re-proves itself after a label change. 3. **Removed status preservation from `WorkerRegistry::replace()`** (worker/registry.rs): the registry is back to being a pure collection. Callers control the worker's initial state via the builder. Tests: - Renamed `test_replace_preserves_status` → `test_builder_status_override_on_replace` to reflect that the test now exercises the builder's `.status()` setter. - Added `test_builder_default_status_is_pending` to lock in the default. Why this is better: - Single responsibility: registry manages collection, builder constructs workers. - Caller control: tests and call sites can opt into either Pending or any other status as needed. - Standard builder pattern: defaults are sensible, overrides are explicit. All 982 tests pass. Refs: PR #1102 review feedback (chatgpt-codex-connector P1) Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@model_gateway/src/workflow/steps/local/update_worker_properties.rs`:
- Around line 90-91: Don't blindly copy worker.status() into the update builder;
instead compute a safe new status using openai_protocol::worker::WorkerStatus
and pass that to .status(...). Specifically, import WorkerStatus, read let prev
= worker.status(), then set let new_status = if prev == WorkerStatus::Failed {
WorkerStatus::NotReady } else if config.disable_health_check && (prev ==
WorkerStatus::Pending || prev == WorkerStatus::NotReady) { WorkerStatus::Ready }
else { prev }; and use .status(new_status) in place of .status(worker.status())
so replacements don't inherit terminal Failed or remain unroutable when health
checks are disabled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8d0024a1-bbb3-4319-bd99-e9fe3da340b6
📒 Files selected for processing (3)
model_gateway/src/worker/builder.rsmodel_gateway/src/worker/registry.rsmodel_gateway/src/workflow/steps/local/update_worker_properties.rs
model_gateway/src/workflow/steps/local/update_worker_properties.rs
Outdated
Show resolved
Hide resolved
model_gateway/src/workflow/steps/local/update_worker_properties.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cec40bf1d9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
model_gateway/src/workflow/steps/local/update_worker_properties.rs
Outdated
Show resolved
Hide resolved
The previous fix used `unwrap_or(false)` which silently discarded the error details. Operators debugging why a worker won't come Ready had no visibility into the underlying cause (connection refused, TLS failure, DNS resolution failure, timeout, etc.). Switched to `unwrap_or_else` which logs the error with the worker URL at warn level before returning false. This preserves the existing state-machine behavior (transport errors still count as failed probes and feed the Pending timeout / NotReady→Failed paths) while making the underlying cause discoverable from logs. Refs: PR #1102 review feedback Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
The previous fix used `unwrap_or(false)` which silently discarded the error details. Operators debugging why a worker won't come Ready had no visibility into the underlying cause (connection refused, TLS failure, DNS resolution failure, timeout, etc.). Switched to `unwrap_or_else` which logs the error with the worker URL at warn level before returning false. This preserves the existing state-machine behavior (transport errors still count as failed probes and feed the Pending timeout / NotReady→Failed paths) while making the underlying cause discoverable from logs. Refs: PR #1102 review feedback Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
1fa4b49 to
cb307ad
Compare
… rot
Self-review of PR 6b found two issues that needed fixing:
1. **Test rot**: `test_health_checker_removes_unhealthy_workers` was
asserting `healthy_workers == 0` after failed probes. With the new
Pending start semantics, that assertion is true from the moment of
registration (Pending workers are not healthy by definition), so
the test passed vacuously without ever exercising removal or any
state transition. Replaced with two focused tests:
- `test_pending_worker_stays_pending_under_failed_probes`: verifies
that Pending is sticky during early failures (does not transition
to NotReady).
- `test_failed_worker_can_be_removed_from_registry`: verifies the
removal API works on workers in Failed state, without needing
the async health checker loop or a JobQueue.
Also renamed `test_health_checker_keeps_unhealthy_workers_when_disabled`
→ `test_health_checker_keeps_workers_when_remove_unhealthy_disabled`
to better describe what it tests.
2. **Wasted work on Failed workers**: `check_health_async` was running
the full HTTP/gRPC probe and incrementing counters for workers in
the terminal Failed state. Added an early return that skips the
probe and returns Err immediately. The health checker loop's
removal logic is unaffected — it still gates on
`worker.status() == Failed`, so removal happens on the next tick.
The Failed match arm is kept (empty) because the Rust compiler
requires exhaustive matching, and removing it would make adding
future variants silent.
All 983 tests pass.
Refs: PR #1102 self-review
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ddc0b41131
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
… rot
Self-review of PR 6b found two issues that needed fixing:
1. **Test rot**: `test_health_checker_removes_unhealthy_workers` was
asserting `healthy_workers == 0` after failed probes. With the new
Pending start semantics, that assertion is true from the moment of
registration (Pending workers are not healthy by definition), so
the test passed vacuously without ever exercising removal or any
state transition. Replaced with two focused tests:
- `test_pending_worker_stays_pending_under_failed_probes`: verifies
that Pending is sticky during early failures (does not transition
to NotReady).
- `test_failed_worker_can_be_removed_from_registry`: verifies the
removal API works on workers in Failed state, without needing
the async health checker loop or a JobQueue.
Also renamed `test_health_checker_keeps_unhealthy_workers_when_disabled`
→ `test_health_checker_keeps_workers_when_remove_unhealthy_disabled`
to better describe what it tests.
2. **Wasted work on Failed workers**: `check_health_async` was running
the full HTTP/gRPC probe and incrementing counters for workers in
the terminal Failed state. Added an early return that skips the
probe and returns Err immediately. The health checker loop's
removal logic is unaffected — it still gates on
`worker.status() == Failed`, so removal happens on the next tick.
The Failed match arm is kept (empty) because the Rust compiler
requires exhaustive matching, and removing it would make adding
future variants silent.
All 983 tests pass.
Refs: PR #1102 self-review
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
ddc0b41 to
60cca3f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 60cca3f249
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Four real bugs surfaced by the latest review round:
1. **`worker_to_info()` collapses Pending/Failed to NotReady**
(worker.rs): API/TUI consumers were seeing `status: NotReady` for
workers in Pending or Failed because the helper computed status from
`is_healthy()` (which is `status == Ready`). Now reads `worker.status()`
directly and exposes the real lifecycle state. `is_healthy` is still
derived from the same snapshot for backwards compatibility, no TOCTOU.
2. **`max_pending_probes` not enforced on success branch** (worker.rs):
`total_pending_probes` is incremented before every Pending probe,
but the cap was only checked in the failure branch. A flapping
pattern (F,S,F,S,...) never reaches `success_threshold`, never
transitions, and grows the counter unbounded. Added the cap check
to the success branch — if a Pending worker probes successfully but
has exceeded `max_pending_probes` total attempts, it transitions to
Failed.
3. **`update_worker_properties.rs` blindly preserves status**
(workflow/steps/local/update_worker_properties.rs): Two real bugs:
- **Failed workers stay Failed**: a metadata-only update on a Failed
worker would inherit the terminal status. Now compute a fresh
`Pending` for Failed workers (they get a chance to recover via
the standard health check loop after the update).
- **disable_health_check flip leaves stale status**: if the update
flips `disable_health_check` from false → true while the worker
was Pending/NotReady, the new worker would be Pending forever
(the health checker skips disabled workers). Now force `Ready`
when `disable_health_check == true`.
4. **`ActivateWorkersStep` doesn't activate when global health checker
is off** (workflow/steps/shared/activate.rs): `server.rs` skips
starting the health checker entirely when global
`disable_health_check` is set. Workers created with per-worker
`disable_health_check: false` would stay Pending forever because no
background task probes them. Now reads
`app_context.router_config.health_check.disable_health_check` and
force-activates all workers when the global checker won't run.
Also fixed a stale comment in worker.rs that said Failed workers can
recover via `register_or_replace`. With the recent changes, `replace()`
preserves the old status, so the actual recovery path is removal +
re-registration.
All 983 tests pass, clippy clean.
Refs: PR #1102 review feedback (chatgpt-codex-connector P1, coderabbitai)
Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
Problem: e2e tests fail with "no available workers" because workflow- created workers stay Pending until the background health checker eventually promotes them. With default config, that takes 120s (check_interval=60 × success_threshold=2), longer than the 60s test gateway timeout. Root cause analysis: my previous ActivateWorkersStep guard (only activate `disable_health_check` workers) was wrong. I missed that the workflow already does a real health probe via DetectBackendStep (HTTP /health GET or gRPC health_check). By the time ActivateWorkers runs, the worker has already been proven reachable — it's safe (and correct) to flip Pending → Ready immediately. This restores the pre-state-machine workflow behavior at the trusted- creation boundary, while keeping the state machine for ongoing failures: the background health checker still demotes workers (Ready → NotReady → Failed) if probes start failing later. Pending remains the correct start state for paths that don't go through the workflow's connectivity proof (e.g. future mesh-imported workers). Changes: - activate.rs: simplified to always call set_healthy(true). Added rationale comment explaining why this is correct (DetectBackendStep already proved reachability). - update_worker_properties.rs: simplified next_status logic. A metadata-only update preserves the worker's prior status (the endpoint hasn't changed, only labels/priority/etc.). Kept the one exception: flipping disable_health_check to true forces Ready, because the health checker skips disabled workers and a stale Pending/NotReady would never recover. All 983 tests pass, clippy clean. Refs: PR #1102 e2e test failures Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 042ab0ef0d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…health_check Workers created through the workflow already had their connectivity proven by DetectBackendStep (HTTP /health probe or gRPC health_check) before reaching ActivateWorkersStep. Flipping them Pending → Ready here is correct because the workflow's connectivity proof is functionally equivalent to one successful state-machine probe. Without this, workflow-created workers stayed Pending until the background health checker promoted them after `check_interval × success_threshold` seconds (~120s with defaults), exceeding e2e gateway startup timeouts. The state machine still applies for ongoing failures: the background health checker demotes Ready → NotReady → Failed on probe failures. Pending remains the start state for paths that don't pre-validate (e.g. mesh-imported workers in future PRs). Signed-off-by: Simo Lin <linsimo.mark@gmail.com>
Description
Problem
Workers start
Ready(routable) immediately at registration, before any local health check has verified them. This causes the mesh-sync 500/503 bug where unverified workers receive traffic and fail. TheWorkerStatusenum from #1093 defined four states but only two were actually used.Solution
Workers with health checks enabled now start
Pending(not routable). The health checker promotes them toReadyaftersuccess_thresholdconsecutive successful probes. Full state machine transitions implemented incheck_health_async().State Machine
Changes
Production code:
worker.rs: addedtotal_pending_probescounter onBasicWorkerbuilder.rs: initial status isPendingif health checks enabled,Readyotherwiseworker.rs:check_health_async()implements the full state machine (Pending→Ready, Ready→NotReady, NotReady→Failed, NotReady→Ready, Pending→Failed)registry.rs:--remove-unhealthy-workersonly removesFailed(notPendingorNotReady)workflow/steps/shared/activate.rs:ActivateWorkersSteponly activates workers withdisable_health_check == trueworkflow/steps/local/create_worker.rs&external/create_workers.rs: removed redundantset_healthycalls (builder handles initial status)Thresholds:
max_pending_probes = 10 × failure_threshold— prevents misconfigured URLs from leaking in Pending foreverliveness_failure_threshold = 3 × failure_threshold— prevents aggressive removal of transiently unhealthy workers (analogous to K8s having separate readiness and liveness probes)Tests:
no_health_check()helper in test modulespolicies/*,routers/http/router.rs, and integration tests to usedisable_health_check: truecommon/mod.rs,test_app.rs) default todisable_health_check: truetest_pending_worker_not_routableto verify new behaviorWhat's NOT in this PR
StatusChangedevents from the health path — deferred to PR 7 when WorkerManager owns the health loopBehavior changes
--remove-unhealthy-workersis now stricter: onlyFailedworkers are removed. Workers inPendingorNotReadyare left alone. This is a fix — the old behavior removed transiently unhealthy workers too aggressively.check_interval_secs × success_thresholdseconds to become routable after registration (default: ~30 seconds with 15s interval × 2 threshold). Workers withdisable_health_check == trueare Ready immediately.Test Plan
cargo check --package smg --package smg-golang— cleancargo test --package smg --lib— 512 tests passcargo test --package smg --test api_tests— 100 tests passcargo test --package smg --test routing_tests— 92 tests passRefs: worker-module-deep-refactor plan (PR 6b)
Release constraint: This PR must roll forward directly into PR 7 (WorkerManager extraction). PR 6b is an implementation staging PR — `StatusChanged` events become authoritative in PR 7.
Checklist
Summary by CodeRabbit