fix(init): recover from stale-step resume when workflow already advanced by betegon · Pull Request #949 · getsentry/cli

betegon · 2026-05-11T16:10:22Z

When the CLI resumes a step (e.g. select-features), the server runs it and immediately executes the next step inline — plan-codemods kicks off its framework-detector LLM, adding ~15s to the total response time. If the HTTP connection drops before the response reaches the CLI (network blip, CF Worker response timeout), the CLI retries resumeAsync with the same step ID. But Mastra now returns 500: "This workflow step 'select-features' was not suspended. Available suspended steps: [plan-codemods]" — the workflow had already advanced and written to D1.

Every retry hits the same 500. With Mastra client's built-in 3 retries stacked on the CLI's own 3 retries, that's 12 Sentry events per incident (CLI-SERVER-9, 53 total since April).

What changes

retries: 0 on MastraClient — disables the client's built-in retry layer so resumeWithRetry is the sole retry owner. Cuts max attempts from 4×4 = 16 to 1×4 = 4 per incident.

Stale-step recovery — when resumeAsync throws with "not suspended" in the message, the new isStepAlreadyAdvancedError helper fires. tryRecoverCurrentRunState calls workflow.runById(run.runId, { fields: ["steps", "activeStepsPath"] }) to fetch the current run state and returns it. The main suspend/resume loop receives the updated state — with plan-codemods now as the suspended step — and continues normally without the user noticing anything went wrong.

If recovery fails (network still down, runById also 500s), the error falls through to the normal retry path.

Testing

Existing wizard-runner tests: 27/27 pass
To simulate: mock resumeAsync to throw "not suspended" on attempt 0, mock runById to return a valid plan-codemods suspended state — verify the runner returns that state without retrying

Fixes CLI-SERVER-9 (wizard handler returned 500 on resume).

github-actions · 2026-05-11T16:11:12Z

PR Preview Action v1.8.1
🚀 View preview at https://cli.sentry.dev/_preview/pr-949/
Built to branch `gh-pages` at 2026-05-12 20:30 UTC. Preview will be ready when the GitHub Pages deployment is complete.

github-actions · 2026-05-11T16:11:59Z

Codecov Results 📊

✅ 6954 passed | Total: 6954 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric	Change
Total Tests	📈 +15
Passed Tests	📈 +15
Failed Tests	—
Skipped Tests	—

All tests are passing successfully.

✅ Patch coverage is 85.71%. Project has 14115 uncovered lines.
✅ Project coverage is 77.05%. Comparing base (base) to head (head).

Files with missing lines (1)

File	Patch %	Lines
src/lib/init/wizard-runner.ts	85.71%	⚠️ 7 Missing

Coverage diff

@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
+ Coverage    76.98%    77.05%    +0.07%
==========================================
  Files          320       320         —
  Lines        61442     61493       +51
  Branches         0         0         —
==========================================
+ Hits         47294     47378       +84
- Misses       14148     14115       -33
- Partials         0         0         —

Generated by Codecov Action

Two fixes from Cursor Bugbot review of PR #949: 1. Remove retries: 0 — it was added to reduce Sentry event noise from the retry storm, but it also stripped the Mastra client's built-in retry safety net from workflow.createRun() and run.startAsync(), which have no custom retry wrapper. A single transient error during startup now immediately fails. Restoring the default (3 retries) keeps those calls protected; "not suspended" incidents still go from 12 events down to 4 (vs 1 before), which is still a significant improvement. 2. Wrap workflow.runById in withTimeout and add "result" to fields — tryRecoverCurrentRunState previously called runById without a timeout, so on an unreliable network it could hang for 30-120s. Consistent with resumeAsync and startAsync which both use API_TIMEOUT_MS. Also adds "result" to the requested fields so that if the workflow already completed with a non-zero exitCode, handleFinalResult maps it to the correct EXIT constant instead of the generic EXIT.WIZARD fallback.

When the CLI resumes a step (e.g. select-features), the server runs it and immediately starts the next step inline (plan-codemods runs its LLM agent, ~15s). If the HTTP response drops before reaching the CLI (network blip, CF timeout), the wizard's retry logic resumes the same step again — but Mastra now returns 500 "step X was not suspended" because the workflow has already moved on. Every retry hits the same 500, generating up to 12 Sentry events per incident. Two fixes: 1. Pass `retries: 0` to MastraClient so the client's built-in retry logic doesn't stack on top of resumeWithRetry's own retries (was 4×4 = 16 attempts; now 1×4 = 4). 2. When resumeAsync throws a "not suspended" error, call workflow.runById() to fetch the current run state and return it to the main loop. The loop then continues from whichever step is actually suspended — turning what was a fatal loop into a transparent recovery. Fixes CLI-SERVER-9.

Increases line coverage from 80% to 92.5%. New tests cover: - resumeWithRetry stale-step recovery: server already advanced (runById returns new state), recovery fails + normal retry takes over, and the recoveryAttempted guard preventing duplicate runById calls - workflow exit code mapping (20→CONFIG, 30→DEPS, 40/41→CODEMOD, 50→VERIFY) - start-request failure path (Connection failed spinner + WizardError) - assertWorkflowResult with unrecognised status - assertSuspendPayload with non-object payload - extractSuspendPayload fallback loop (payload in alternate step) - setStep completed/in_progress transitions across two steps - detect-platform custom spinner label when existingProject.platform is set Also adds the recoveryAttempted guard to resumeWithRetry so state recovery is only attempted once per call even when multiple retries hit the same "not suspended" error.

Two fixes from Cursor Bugbot review of PR #949: 1. Remove retries: 0 — it was added to reduce Sentry event noise from the retry storm, but it also stripped the Mastra client's built-in retry safety net from workflow.createRun() and run.startAsync(), which have no custom retry wrapper. A single transient error during startup now immediately fails. Restoring the default (3 retries) keeps those calls protected; "not suspended" incidents still go from 12 events down to 4 (vs 1 before), which is still a significant improvement. 2. Wrap workflow.runById in withTimeout and add "result" to fields — tryRecoverCurrentRunState previously called runById without a timeout, so on an unreliable network it could hang for 30-120s. Consistent with resumeAsync and startAsync which both use API_TIMEOUT_MS. Also adds "result" to the requested fields so that if the workflow already completed with a non-zero exitCode, handleFinalResult maps it to the correct EXIT constant instead of the generic EXIT.WIZARD fallback.

The previous implementation fell through to normal retries after a failed recovery attempt, contradicting its own comment ("Retrying the same step will always 500"). The recoveryAttempted guard blocked the only useful action (another runById call) while allowing useless ones (3 more retries of the stale step with 2+4+8 = 14s of wasted backoff). Instead, throw immediately after failed recovery. The step is confirmed not suspended, so there is nothing a retry can accomplish. Transient errors (network timeout, auth flap) continue to use normal backoff retry since they never enter the isStepAlreadyAdvancedError branch. Also removes the now-unnecessary recoveryAttempted flag.

Two review recommendations from /review: 1. Tighten isStepAlreadyAdvancedError to match "was not suspended" instead of "not suspended" — both Mastra error variants use the more specific phrase, which reduces the false-positive surface area. 2. Show "Reconnecting..." spinner message before the runById recovery call so the user sees feedback during a potentially slow state fetch instead of watching a stale spinner message. Passes spin through ResumeRetryArgs to give resumeWithRetry access to the running spinner handle.

Previously the recovery path had no observability — there was no way to tell from Sentry whether the fix was working in production. - addBreadcrumb when recovery succeeds: appears in subsequent traces so we can confirm the recovery mechanism is actually being triggered and seamlessly continuing the wizard. - captureException (warning) when recovery fails: creates a separate trackable event tagged wizard.stale_step_recovery=failed with the step ID and run ID so we can investigate edge cases where runById is unreachable or returns unexpected data. The server-side CLI-SERVER-9 events (from withWorkflowErrorCapture) continue to fire on each "step not suspended" 500, but these new client-side events tell us the outcome of the recovery attempt.

sentry · 2026-05-12T20:33:34Z

+      workflow.runById(runId, {
+        fields: ["steps", "activeStepsPath", "result"],
+      }),
+      API_TIMEOUT_MS,
+      "Run state recovery"
+    );
+    return assertWorkflowResult(state);


Bug: The tryRecoverCurrentRunState function fails to request the status field from the API, causing assertWorkflowResult validation to throw and silently abort the recovery mechanism.
_{Severity: HIGH}

Suggested Fix

Include "status" and "suspended" in the fields array passed to workflow.runById within the tryRecoverCurrentRunState function to ensure the recovery mechanism receives the necessary data for validation and state resumption. The updated fields should be ["status", "suspended", "steps", "activeStepsPath", "result"].

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: src/lib/init/wizard-runner.ts#L463-L469 Potential issue: The stale-step recovery mechanism in `tryRecoverCurrentRunState` is likely to fail in production. It calls `workflow.runById` but omits the `status` field from the requested `fields`. The returned state is then passed to `assertWorkflowResult`, which validates the presence of `status`. Since `status` is missing, the validation throws an error. This error is caught and `null` is returned, causing the recovery to silently fail and the original error to be re-thrown, ultimately crashing the wizard for the user.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit a382a02. Configure here.}

cursor · 2026-05-12T20:38:43Z

+    // Threw immediately — runById tried once, resumeAsync tried once.
+    expect(resumeCount).toBe(1);
+    expect(runByIdMock).toHaveBeenCalledTimes(1);
+  });


Duplicate test with misleading comment describes wrong scenario

Low Severity

The comment on line 1020 says "Attempts 0 and 1 both get 'not suspended'; attempt 2 succeeds" but the test actually verifies the opposite — that recovery fails and the function throws immediately after a single attempt. The test is also a near-duplicate of the previous test at line 998; both configure resumeAsync to always reject with a stale-step error, runById to fail (only the error message differs), and assert identical expectations (resumeCount === 1, runByIdMock called once, throws WizardError). The different error messages don't exercise different code paths since tryRecoverCurrentRunState catches all errors uniformly.

Additional Locations (1)

test/lib/init/wizard-runner.test.ts#L997-L1018

^{Reviewed by Cursor Bugbot for commit a382a02. Configure here.}

betegon marked this pull request as ready for review May 11, 2026 16:17

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread src/lib/init/wizard-runner.ts Outdated

Comment thread src/lib/init/wizard-runner.ts

sentry Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/lib/init/wizard-runner.ts

cursor Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/lib/init/wizard-runner.ts

sentry Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/lib/init/wizard-runner.ts

betegon added 6 commits May 12, 2026 22:29

betegon force-pushed the fix/init-recover-from-stale-step-resume branch from ec71dc0 to a382a02 Compare May 12, 2026 20:29

sentry Bot reviewed May 12, 2026

View reviewed changes

cursor Bot reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(init): recover from stale-step resume when workflow already advanced#949

fix(init): recover from stale-step resume when workflow already advanced#949
betegon wants to merge 6 commits into
mainfrom
fix/init-recover-from-stale-step-resume

betegon commented May 11, 2026

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-12 20:30 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sentry Bot May 12, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

betegon commented May 11, 2026

What changes

Testing

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-05-12 20:30 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Results 📊

📊 Comparison with Base Branch

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sentry Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 12, 2026

Choose a reason for hiding this comment

Duplicate test with misleading comment describes wrong scenario

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 11, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-12 20:30 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

github-actions Bot commented May 11, 2026 •

edited

Loading