fix(init): recover from stale-step resume when workflow already advanced#949
fix(init): recover from stale-step resume when workflow already advanced#949betegon wants to merge 6 commits into
Conversation
|
Codecov Results 📊✅ 6954 passed | Total: 6954 | Pass Rate: 100% | Execution Time: 0ms 📊 Comparison with Base Branch
All tests are passing successfully. ✅ Patch coverage is 85.71%. Project has 14115 uncovered lines. Files with missing lines (1)
Coverage diff@@ Coverage Diff @@
## main #PR +/-##
==========================================
+ Coverage 76.98% 77.05% +0.07%
==========================================
Files 320 320 —
Lines 61442 61493 +51
Branches 0 0 —
==========================================
+ Hits 47294 47378 +84
- Misses 14148 14115 -33
- Partials 0 0 —Generated by Codecov Action |
Two fixes from Cursor Bugbot review of PR #949: 1. Remove retries: 0 — it was added to reduce Sentry event noise from the retry storm, but it also stripped the Mastra client's built-in retry safety net from workflow.createRun() and run.startAsync(), which have no custom retry wrapper. A single transient error during startup now immediately fails. Restoring the default (3 retries) keeps those calls protected; "not suspended" incidents still go from 12 events down to 4 (vs 1 before), which is still a significant improvement. 2. Wrap workflow.runById in withTimeout and add "result" to fields — tryRecoverCurrentRunState previously called runById without a timeout, so on an unreliable network it could hang for 30-120s. Consistent with resumeAsync and startAsync which both use API_TIMEOUT_MS. Also adds "result" to the requested fields so that if the workflow already completed with a non-zero exitCode, handleFinalResult maps it to the correct EXIT constant instead of the generic EXIT.WIZARD fallback.
When the CLI resumes a step (e.g. select-features), the server runs it and immediately starts the next step inline (plan-codemods runs its LLM agent, ~15s). If the HTTP response drops before reaching the CLI (network blip, CF timeout), the wizard's retry logic resumes the same step again — but Mastra now returns 500 "step X was not suspended" because the workflow has already moved on. Every retry hits the same 500, generating up to 12 Sentry events per incident. Two fixes: 1. Pass `retries: 0` to MastraClient so the client's built-in retry logic doesn't stack on top of resumeWithRetry's own retries (was 4×4 = 16 attempts; now 1×4 = 4). 2. When resumeAsync throws a "not suspended" error, call workflow.runById() to fetch the current run state and return it to the main loop. The loop then continues from whichever step is actually suspended — turning what was a fatal loop into a transparent recovery. Fixes CLI-SERVER-9.
Increases line coverage from 80% to 92.5%. New tests cover: - resumeWithRetry stale-step recovery: server already advanced (runById returns new state), recovery fails + normal retry takes over, and the recoveryAttempted guard preventing duplicate runById calls - workflow exit code mapping (20→CONFIG, 30→DEPS, 40/41→CODEMOD, 50→VERIFY) - start-request failure path (Connection failed spinner + WizardError) - assertWorkflowResult with unrecognised status - assertSuspendPayload with non-object payload - extractSuspendPayload fallback loop (payload in alternate step) - setStep completed/in_progress transitions across two steps - detect-platform custom spinner label when existingProject.platform is set Also adds the recoveryAttempted guard to resumeWithRetry so state recovery is only attempted once per call even when multiple retries hit the same "not suspended" error.
Two fixes from Cursor Bugbot review of PR #949: 1. Remove retries: 0 — it was added to reduce Sentry event noise from the retry storm, but it also stripped the Mastra client's built-in retry safety net from workflow.createRun() and run.startAsync(), which have no custom retry wrapper. A single transient error during startup now immediately fails. Restoring the default (3 retries) keeps those calls protected; "not suspended" incidents still go from 12 events down to 4 (vs 1 before), which is still a significant improvement. 2. Wrap workflow.runById in withTimeout and add "result" to fields — tryRecoverCurrentRunState previously called runById without a timeout, so on an unreliable network it could hang for 30-120s. Consistent with resumeAsync and startAsync which both use API_TIMEOUT_MS. Also adds "result" to the requested fields so that if the workflow already completed with a non-zero exitCode, handleFinalResult maps it to the correct EXIT constant instead of the generic EXIT.WIZARD fallback.
The previous implementation fell through to normal retries after a failed
recovery attempt, contradicting its own comment ("Retrying the same step
will always 500"). The recoveryAttempted guard blocked the only useful
action (another runById call) while allowing useless ones (3 more retries
of the stale step with 2+4+8 = 14s of wasted backoff).
Instead, throw immediately after failed recovery. The step is confirmed
not suspended, so there is nothing a retry can accomplish. Transient
errors (network timeout, auth flap) continue to use normal backoff retry
since they never enter the isStepAlreadyAdvancedError branch.
Also removes the now-unnecessary recoveryAttempted flag.
Two review recommendations from /review: 1. Tighten isStepAlreadyAdvancedError to match "was not suspended" instead of "not suspended" — both Mastra error variants use the more specific phrase, which reduces the false-positive surface area. 2. Show "Reconnecting..." spinner message before the runById recovery call so the user sees feedback during a potentially slow state fetch instead of watching a stale spinner message. Passes spin through ResumeRetryArgs to give resumeWithRetry access to the running spinner handle.
Previously the recovery path had no observability — there was no way to tell from Sentry whether the fix was working in production. - addBreadcrumb when recovery succeeds: appears in subsequent traces so we can confirm the recovery mechanism is actually being triggered and seamlessly continuing the wizard. - captureException (warning) when recovery fails: creates a separate trackable event tagged wizard.stale_step_recovery=failed with the step ID and run ID so we can investigate edge cases where runById is unreachable or returns unexpected data. The server-side CLI-SERVER-9 events (from withWorkflowErrorCapture) continue to fire on each "step not suspended" 500, but these new client-side events tell us the outcome of the recovery attempt.
ec71dc0 to
a382a02
Compare
| workflow.runById(runId, { | ||
| fields: ["steps", "activeStepsPath", "result"], | ||
| }), | ||
| API_TIMEOUT_MS, | ||
| "Run state recovery" | ||
| ); | ||
| return assertWorkflowResult(state); |
There was a problem hiding this comment.
Bug: The tryRecoverCurrentRunState function fails to request the status field from the API, causing assertWorkflowResult validation to throw and silently abort the recovery mechanism.
Severity: HIGH
Suggested Fix
Include "status" and "suspended" in the fields array passed to workflow.runById within the tryRecoverCurrentRunState function to ensure the recovery mechanism receives the necessary data for validation and state resumption. The updated fields should be ["status", "suspended", "steps", "activeStepsPath", "result"].
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/lib/init/wizard-runner.ts#L463-L469
Potential issue: The stale-step recovery mechanism in `tryRecoverCurrentRunState` is
likely to fail in production. It calls `workflow.runById` but omits the `status` field
from the requested `fields`. The returned state is then passed to
`assertWorkflowResult`, which validates the presence of `status`. Since `status` is
missing, the validation throws an error. This error is caught and `null` is returned,
causing the recovery to silently fail and the original error to be re-thrown, ultimately
crashing the wizard for the user.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a382a02. Configure here.
| // Threw immediately — runById tried once, resumeAsync tried once. | ||
| expect(resumeCount).toBe(1); | ||
| expect(runByIdMock).toHaveBeenCalledTimes(1); | ||
| }); |
There was a problem hiding this comment.
Duplicate test with misleading comment describes wrong scenario
Low Severity
The comment on line 1020 says "Attempts 0 and 1 both get 'not suspended'; attempt 2 succeeds" but the test actually verifies the opposite — that recovery fails and the function throws immediately after a single attempt. The test is also a near-duplicate of the previous test at line 998; both configure resumeAsync to always reject with a stale-step error, runById to fail (only the error message differs), and assert identical expectations (resumeCount === 1, runByIdMock called once, throws WizardError). The different error messages don't exercise different code paths since tryRecoverCurrentRunState catches all errors uniformly.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a382a02. Configure here.


When the CLI resumes a step (e.g.
select-features), the server runs it and immediately executes the next step inline —plan-codemodskicks off its framework-detector LLM, adding ~15s to the total response time. If the HTTP connection drops before the response reaches the CLI (network blip, CF Worker response timeout), the CLI retriesresumeAsyncwith the same step ID. But Mastra now returns 500: "This workflow step 'select-features' was not suspended. Available suspended steps: [plan-codemods]" — the workflow had already advanced and written to D1.Every retry hits the same 500. With Mastra client's built-in 3 retries stacked on the CLI's own 3 retries, that's 12 Sentry events per incident (CLI-SERVER-9, 53 total since April).
What changes
retries: 0on MastraClient — disables the client's built-in retry layer soresumeWithRetryis the sole retry owner. Cuts max attempts from 4×4 = 16 to 1×4 = 4 per incident.Stale-step recovery — when
resumeAsyncthrows with"not suspended"in the message, the newisStepAlreadyAdvancedErrorhelper fires.tryRecoverCurrentRunStatecallsworkflow.runById(run.runId, { fields: ["steps", "activeStepsPath"] })to fetch the current run state and returns it. The main suspend/resume loop receives the updated state — withplan-codemodsnow as the suspended step — and continues normally without the user noticing anything went wrong.If recovery fails (network still down, runById also 500s), the error falls through to the normal retry path.
Testing
resumeAsyncto throw"not suspended"on attempt 0, mockrunByIdto return a validplan-codemodssuspended state — verify the runner returns that state without retryingFixes CLI-SERVER-9 (
wizard handler returned 500on resume).