Skip to content

fix(init): recover from stale-step resume when workflow already advanced#949

Open
betegon wants to merge 6 commits into
mainfrom
fix/init-recover-from-stale-step-resume
Open

fix(init): recover from stale-step resume when workflow already advanced#949
betegon wants to merge 6 commits into
mainfrom
fix/init-recover-from-stale-step-resume

Conversation

@betegon
Copy link
Copy Markdown
Member

@betegon betegon commented May 11, 2026

When the CLI resumes a step (e.g. select-features), the server runs it and immediately executes the next step inline — plan-codemods kicks off its framework-detector LLM, adding ~15s to the total response time. If the HTTP connection drops before the response reaches the CLI (network blip, CF Worker response timeout), the CLI retries resumeAsync with the same step ID. But Mastra now returns 500: "This workflow step 'select-features' was not suspended. Available suspended steps: [plan-codemods]" — the workflow had already advanced and written to D1.

Every retry hits the same 500. With Mastra client's built-in 3 retries stacked on the CLI's own 3 retries, that's 12 Sentry events per incident (CLI-SERVER-9, 53 total since April).

What changes

retries: 0 on MastraClient — disables the client's built-in retry layer so resumeWithRetry is the sole retry owner. Cuts max attempts from 4×4 = 16 to 1×4 = 4 per incident.

Stale-step recovery — when resumeAsync throws with "not suspended" in the message, the new isStepAlreadyAdvancedError helper fires. tryRecoverCurrentRunState calls workflow.runById(run.runId, { fields: ["steps", "activeStepsPath"] }) to fetch the current run state and returns it. The main suspend/resume loop receives the updated state — with plan-codemods now as the suspended step — and continues normally without the user noticing anything went wrong.

If recovery fails (network still down, runById also 500s), the error falls through to the normal retry path.

Testing

  • Existing wizard-runner tests: 27/27 pass
  • To simulate: mock resumeAsync to throw "not suspended" on attempt 0, mock runById to return a valid plan-codemods suspended state — verify the runner returns that state without retrying

Fixes CLI-SERVER-9 (wizard handler returned 500 on resume).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://cli.sentry.dev/_preview/pr-949/

Built to branch gh-pages at 2026-05-12 20:30 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

Codecov Results 📊

6954 passed | Total: 6954 | Pass Rate: 100% | Execution Time: 0ms

📊 Comparison with Base Branch

Metric Change
Total Tests 📈 +15
Passed Tests 📈 +15
Failed Tests
Skipped Tests

All tests are passing successfully.

✅ Patch coverage is 85.71%. Project has 14115 uncovered lines.
✅ Project coverage is 77.05%. Comparing base (base) to head (head).

Files with missing lines (1)
File Patch % Lines
src/lib/init/wizard-runner.ts 85.71% ⚠️ 7 Missing
Coverage diff
@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
+ Coverage    76.98%    77.05%    +0.07%
==========================================
  Files          320       320         —
  Lines        61442     61493       +51
  Branches         0         0         —
==========================================
+ Hits         47294     47378       +84
- Misses       14148     14115       -33
- Partials         0         0         —

Generated by Codecov Action

@betegon betegon marked this pull request as ready for review May 11, 2026 16:17
Comment thread src/lib/init/wizard-runner.ts Outdated
Comment thread src/lib/init/wizard-runner.ts
Comment thread src/lib/init/wizard-runner.ts
betegon added a commit that referenced this pull request May 12, 2026
Two fixes from Cursor Bugbot review of PR #949:

1. Remove retries: 0 — it was added to reduce Sentry event noise from
   the retry storm, but it also stripped the Mastra client's built-in
   retry safety net from workflow.createRun() and run.startAsync(), which
   have no custom retry wrapper. A single transient error during startup
   now immediately fails. Restoring the default (3 retries) keeps those
   calls protected; "not suspended" incidents still go from 12 events
   down to 4 (vs 1 before), which is still a significant improvement.

2. Wrap workflow.runById in withTimeout and add "result" to fields —
   tryRecoverCurrentRunState previously called runById without a timeout,
   so on an unreliable network it could hang for 30-120s. Consistent with
   resumeAsync and startAsync which both use API_TIMEOUT_MS. Also adds
   "result" to the requested fields so that if the workflow already
   completed with a non-zero exitCode, handleFinalResult maps it to the
   correct EXIT constant instead of the generic EXIT.WIZARD fallback.
Comment thread src/lib/init/wizard-runner.ts
Comment thread src/lib/init/wizard-runner.ts
betegon added 6 commits May 12, 2026 22:29
When the CLI resumes a step (e.g. select-features), the server runs it
and immediately starts the next step inline (plan-codemods runs its LLM
agent, ~15s). If the HTTP response drops before reaching the CLI (network
blip, CF timeout), the wizard's retry logic resumes the same step again —
but Mastra now returns 500 "step X was not suspended" because the workflow
has already moved on. Every retry hits the same 500, generating up to 12
Sentry events per incident.

Two fixes:

1. Pass `retries: 0` to MastraClient so the client's built-in retry logic
   doesn't stack on top of resumeWithRetry's own retries (was 4×4 = 16
   attempts; now 1×4 = 4).

2. When resumeAsync throws a "not suspended" error, call
   workflow.runById() to fetch the current run state and return it to the
   main loop. The loop then continues from whichever step is actually
   suspended — turning what was a fatal loop into a transparent recovery.

Fixes CLI-SERVER-9.
Increases line coverage from 80% to 92.5%.

New tests cover:
- resumeWithRetry stale-step recovery: server already advanced (runById
  returns new state), recovery fails + normal retry takes over, and the
  recoveryAttempted guard preventing duplicate runById calls
- workflow exit code mapping (20→CONFIG, 30→DEPS, 40/41→CODEMOD, 50→VERIFY)
- start-request failure path (Connection failed spinner + WizardError)
- assertWorkflowResult with unrecognised status
- assertSuspendPayload with non-object payload
- extractSuspendPayload fallback loop (payload in alternate step)
- setStep completed/in_progress transitions across two steps
- detect-platform custom spinner label when existingProject.platform is set

Also adds the recoveryAttempted guard to resumeWithRetry so state recovery
is only attempted once per call even when multiple retries hit the same
"not suspended" error.
Two fixes from Cursor Bugbot review of PR #949:

1. Remove retries: 0 — it was added to reduce Sentry event noise from
   the retry storm, but it also stripped the Mastra client's built-in
   retry safety net from workflow.createRun() and run.startAsync(), which
   have no custom retry wrapper. A single transient error during startup
   now immediately fails. Restoring the default (3 retries) keeps those
   calls protected; "not suspended" incidents still go from 12 events
   down to 4 (vs 1 before), which is still a significant improvement.

2. Wrap workflow.runById in withTimeout and add "result" to fields —
   tryRecoverCurrentRunState previously called runById without a timeout,
   so on an unreliable network it could hang for 30-120s. Consistent with
   resumeAsync and startAsync which both use API_TIMEOUT_MS. Also adds
   "result" to the requested fields so that if the workflow already
   completed with a non-zero exitCode, handleFinalResult maps it to the
   correct EXIT constant instead of the generic EXIT.WIZARD fallback.
The previous implementation fell through to normal retries after a failed
recovery attempt, contradicting its own comment ("Retrying the same step
will always 500"). The recoveryAttempted guard blocked the only useful
action (another runById call) while allowing useless ones (3 more retries
of the stale step with 2+4+8 = 14s of wasted backoff).

Instead, throw immediately after failed recovery. The step is confirmed
not suspended, so there is nothing a retry can accomplish. Transient
errors (network timeout, auth flap) continue to use normal backoff retry
since they never enter the isStepAlreadyAdvancedError branch.

Also removes the now-unnecessary recoveryAttempted flag.
Two review recommendations from /review:

1. Tighten isStepAlreadyAdvancedError to match "was not suspended" instead
   of "not suspended" — both Mastra error variants use the more specific
   phrase, which reduces the false-positive surface area.

2. Show "Reconnecting..." spinner message before the runById recovery call
   so the user sees feedback during a potentially slow state fetch instead
   of watching a stale spinner message. Passes spin through ResumeRetryArgs
   to give resumeWithRetry access to the running spinner handle.
Previously the recovery path had no observability — there was no way
to tell from Sentry whether the fix was working in production.

- addBreadcrumb when recovery succeeds: appears in subsequent traces
  so we can confirm the recovery mechanism is actually being triggered
  and seamlessly continuing the wizard.

- captureException (warning) when recovery fails: creates a separate
  trackable event tagged wizard.stale_step_recovery=failed with the
  step ID and run ID so we can investigate edge cases where runById
  is unreachable or returns unexpected data.

The server-side CLI-SERVER-9 events (from withWorkflowErrorCapture)
continue to fire on each "step not suspended" 500, but these new
client-side events tell us the outcome of the recovery attempt.
@betegon betegon force-pushed the fix/init-recover-from-stale-step-resume branch from ec71dc0 to a382a02 Compare May 12, 2026 20:29
Comment on lines +463 to +469
workflow.runById(runId, {
fields: ["steps", "activeStepsPath", "result"],
}),
API_TIMEOUT_MS,
"Run state recovery"
);
return assertWorkflowResult(state);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The tryRecoverCurrentRunState function fails to request the status field from the API, causing assertWorkflowResult validation to throw and silently abort the recovery mechanism.
Severity: HIGH

Suggested Fix

Include "status" and "suspended" in the fields array passed to workflow.runById within the tryRecoverCurrentRunState function to ensure the recovery mechanism receives the necessary data for validation and state resumption. The updated fields should be ["status", "suspended", "steps", "activeStepsPath", "result"].

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: src/lib/init/wizard-runner.ts#L463-L469

Potential issue: The stale-step recovery mechanism in `tryRecoverCurrentRunState` is
likely to fail in production. It calls `workflow.runById` but omits the `status` field
from the requested `fields`. The returned state is then passed to
`assertWorkflowResult`, which validates the presence of `status`. Since `status` is
missing, the validation throws an error. This error is caught and `null` is returned,
causing the recovery to silently fail and the original error to be re-thrown, ultimately
crashing the wizard for the user.

Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a382a02. Configure here.

// Threw immediately — runById tried once, resumeAsync tried once.
expect(resumeCount).toBe(1);
expect(runByIdMock).toHaveBeenCalledTimes(1);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate test with misleading comment describes wrong scenario

Low Severity

The comment on line 1020 says "Attempts 0 and 1 both get 'not suspended'; attempt 2 succeeds" but the test actually verifies the opposite — that recovery fails and the function throws immediately after a single attempt. The test is also a near-duplicate of the previous test at line 998; both configure resumeAsync to always reject with a stale-step error, runById to fail (only the error message differs), and assert identical expectations (resumeCount === 1, runByIdMock called once, throws WizardError). The different error messages don't exercise different code paths since tryRecoverCurrentRunState catches all errors uniformly.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a382a02. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant