Problem Statement
The file-backed generate workflow has grown a strong user-facing contract, but that contract is spread across several shallow internal modules. A single run now involves deciding whether the source must be replayable, materializing standard input when needed, deriving the checkpoint location, staging fresh output/checkpoint artifacts, validating resume state, optionally building an ephemeral resume plan, opening the durable sink, and cleaning all of that up correctly on success or failure.
That behavior is important enough to deserve one deeper module boundary. Today, understanding or changing it requires following the same lifecycle through source handling, checkpoint storage, resume validation, runtime setup, scheduler handoff, CLI environment handling, and tests. The tests mirror that fragmentation: several of them assert private checkpoint path rules, private SQLite state, temporary planner file cleanup, and helper-level behavior rather than exercising the artifact lifecycle as one observable unit.
This makes the code harder to maintain because the riskiest invariants are cross-module:
- A fresh file-backed run must never clobber existing output or checkpoint artifacts until bootstrap succeeds.
- File-backed runs from standard input must stage a replayable source before checkpoint bootstrap and resume validation.
- Resume must validate that the visible output still contains rows for settled checkpoint items.
- Resume must use the same checkpoint directory decision as the original run.
- Planner and staged-source temporary files must be cleaned up even when setup, scheduling, persistence, or validation fails.
- Output rows must be written before checkpoint rows are marked settled, so crashes may duplicate work but should not silently lose rows.
The developer problem is not that these behaviors are missing; it is that the current structure makes the lifecycle harder than necessary to reason about, evolve, and test as a boundary.
Solution
Introduce a deeper internal "generate run artifact lifecycle" boundary that owns file-backed workflow setup, artifact staging, resume validation/planning, durable sink construction, and cleanup. The scheduler should receive a prepared run context rather than knowing how file-backed artifacts were derived or opened.
The new boundary should expose a small internal interface shaped around preparing and closing a generate run. It should hide the implementation details of checkpoint-path derivation, standard-input materialization, fresh-run staging, resume validation, resume-plan lifecycle, and persistence-sink construction.
The refactor should preserve all public behavior. Users should see the same CLI flags, output JSONL shape, checkpoint file behavior, resume semantics, progress/status messages where practical, and failure messages unless the current wording is accidentally tied to implementation details. No checkpoint schema migration is intended.
Commits
-
Add a focused characterization test around the file-backed generate artifact lifecycle for a fresh JSONL run. The test should start from input rows and a fake generation client, then assert the visible output artifact and checkpoint artifact are created and agree on settled rows. Keep the existing implementation unchanged.
-
Add a characterization test for the no-output/stdout path. It should assert that a stdout-only generate run does not create checkpoint artifacts or resume planner artifacts. This protects the distinction between file-backed and stream-only execution before moving lifecycle code.
-
Add a characterization test for file-backed standard-input materialization. It should exercise a run with output configured but no prompt or input path, then assert the run succeeds and leaves no staged-source temporary file behind. The assertion should stay at the artifact boundary rather than reaching into implementation-specific helper calls.
-
Add a characterization test for fresh-run bootstrap failure preserving existing artifacts. The test should start with an existing output/checkpoint pair, trigger a setup failure before bootstrap can complete, and assert the old artifacts are unchanged.
-
Add a characterization test for resume with an explicit checkpoint directory. The test should create a prior run state, resume with the same checkpoint-directory setting, and assert only pending work is appended. It should also assert that resuming without the matching checkpoint-directory setting fails cleanly.
-
Add a characterization test for cleanup when resume planning fails. The test should force a planner or validation failure and assert temporary planner artifacts are removed. Prefer observing the temp directory over asserting private cleanup functions.
-
Introduce an internal lifecycle object that represents the prepared resources for one generate run. Initially, make it a thin wrapper around the existing setup return value so behavior is unchanged. It should own close/cleanup and expose only the scheduler-facing pieces needed to run work.
-
Move cleanup responsibility fully behind the lifecycle object. Existing callers should close the lifecycle object, not close individual resources. Keep the existing cleanup order and "raise the first cleanup error" behavior.
-
Move checkpoint path resolution and file-backed run detection into the lifecycle boundary. The caller should pass the raw generate-run options; the lifecycle boundary should decide whether an output/checkpoint pair is needed and whether standard input must be staged.
-
Move standard-input materialization behind the lifecycle boundary. The rest of the workflow should only see an effective replayable source when one is required.
-
Move fresh-run artifact staging behind the lifecycle boundary. Preserve the current behavior that checkpoint bootstrap occurs against temporary artifacts first, and visible artifacts are replaced only after bootstrap succeeds.
-
Move resume validation and resume-plan creation behind the lifecycle boundary. The scheduler should not need to know whether the prepared source uses sequential resume or a precomputed resume plan.
-
Move durable persistence-sink construction behind the lifecycle boundary. The scheduler should receive an optional sink-like dependency, but the rules for opening output and checkpoint files should stay inside the artifact lifecycle boundary.
-
Collapse status-message emission for setup into the lifecycle boundary. Keep user-facing status text stable where possible, and make status emission a dependency of lifecycle preparation rather than ad hoc calls from multiple places.
-
Rename internal types and methods around the new lifecycle boundary so their responsibilities are explicit. The names should describe artifact lifecycle and prepared run ownership, not low-level runtime setup.
-
Update the scheduler call site so it consumes the prepared lifecycle context through a narrow interface. It should only need a preparer, optional persistence sink, resume/no-work indication, and close semantics.
-
Reduce direct private-helper usage in workflow tests where boundary coverage now exists. Keep low-level tests only for logic that remains deliberately small and tricky, such as path derivation edge cases or SQLite schema validation, if those tests still justify their existence.
-
Update test fakes to avoid manufacturing checkpoint state through private implementation details where a boundary helper or prior run can create the state naturally. Keep explicit checkpoint-state construction only for tests that intentionally exercise corrupted or historical artifacts.
-
Run the workflow tests and fix any behavior-preserving regressions. At this step, keep fixes mechanical and scoped to the lifecycle refactor.
-
Run the CLI tests that cover generate/resume behavior and fix any CLI integration regressions. Confirm that environment-variable checkpoint directory behavior and command-line override behavior remain unchanged.
-
Run the full test suite. Any failures outside the generate workflow area should be investigated as accidental coupling and fixed without broadening the refactor.
-
Update contributor-facing documentation only if module names, maintenance guidance, or internal workflow descriptions have changed. Do not change user-facing README or guide behavior descriptions unless the refactor exposes a real documentation drift.
Decision Document
- Build one deeper internal boundary for the generate run artifact lifecycle.
- Keep the public CLI contract unchanged.
- Keep the existing checkpoint schema unchanged.
- Keep output JSONL as the user-visible result artifact.
- Keep SQLite as the checkpoint artifact for this refactor.
- Keep current resume semantics: settled rows are skipped, pending rows are appended, duplicate source rows are tracked independently, and source additions/removals/deduplication before resume are rejected.
- Keep current crash-safety ordering: write the user-visible output row before marking the checkpoint item settled.
- Keep temporary resume planner databases as implementation detail.
- Keep standard-input staging as implementation detail for file-backed runs that need replayable input.
- Keep the scheduler focused on work admission and settlement, not artifact derivation or cleanup.
- Treat status reporting as an injectable side effect of lifecycle preparation.
- Treat persistence as a dependency prepared by the lifecycle boundary, not something the scheduler constructs.
- Prefer artifact-level tests over tests that assert internal checkpoint helper shapes.
- Allow low-level tests only where the behavior is genuinely algorithmic, defensive, or hard to verify through the boundary.
Testing Decisions
- Good tests for this refactor should assert external behavior through run inputs, output artifacts, checkpoint artifacts, resume outcomes, and failure messages. They should avoid asserting how private helper functions collaborate.
- The main boundary tests should exercise fresh file-backed runs, stdout-only runs, standard-input file-backed runs, resume with checkpoint directory overrides, setup failure preserving artifacts, resume validation failures, and temporary artifact cleanup.
- Existing workflow tests already provide strong prior art: they run the generate workflow with fake clients and inspect output JSONL plus SQLite checkpoint state. The refactor should lean into that pattern while reducing private-helper coupling where possible.
- Existing CLI tests provide prior art for environment-variable and flag precedence around checkpoint directories. These should remain as CLI adapter tests, while deeper artifact lifecycle behavior should move closer to the workflow boundary.
- Persistence failure tests should continue to verify the observable crash-safety contract: a failure while settling a row is surfaced and does not create silent row loss.
- Resume planner cleanup tests should observe temporary artifact absence after failure rather than monkeypatching cleanup internals unless fault injection is unavoidable.
- Full verification should include targeted workflow tests, targeted CLI generate/resume tests, and then the full Python test suite.
- The final pre-push quality gate remains the repository standard: run the full pre-commit suite before pushing.
Out of Scope
- No public CLI behavior changes.
- No new user-facing resume flags.
- No checkpoint schema migration.
- No switch from SQLite checkpoints to another state format.
- No change to output JSONL row shape.
- No broad scheduler rewrite.
- No changes to generation provider calls, rate limiting, retry logic, or result parsing.
- No attempt to refactor embedding or transcription workflows.
- No performance optimization beyond preserving current streaming and resume behavior.
- No change to mapper semantics or mapping fingerprint rules.
Further Notes
This should be treated as a behavior-preserving refactor. The payoff is a smaller internal interface around a high-risk lifecycle, fewer tests coupled to private helper shapes, and an easier path for future endpoint workflow parity.
Problem Statement
The file-backed
generateworkflow has grown a strong user-facing contract, but that contract is spread across several shallow internal modules. A single run now involves deciding whether the source must be replayable, materializing standard input when needed, deriving the checkpoint location, staging fresh output/checkpoint artifacts, validating resume state, optionally building an ephemeral resume plan, opening the durable sink, and cleaning all of that up correctly on success or failure.That behavior is important enough to deserve one deeper module boundary. Today, understanding or changing it requires following the same lifecycle through source handling, checkpoint storage, resume validation, runtime setup, scheduler handoff, CLI environment handling, and tests. The tests mirror that fragmentation: several of them assert private checkpoint path rules, private SQLite state, temporary planner file cleanup, and helper-level behavior rather than exercising the artifact lifecycle as one observable unit.
This makes the code harder to maintain because the riskiest invariants are cross-module:
The developer problem is not that these behaviors are missing; it is that the current structure makes the lifecycle harder than necessary to reason about, evolve, and test as a boundary.
Solution
Introduce a deeper internal "generate run artifact lifecycle" boundary that owns file-backed workflow setup, artifact staging, resume validation/planning, durable sink construction, and cleanup. The scheduler should receive a prepared run context rather than knowing how file-backed artifacts were derived or opened.
The new boundary should expose a small internal interface shaped around preparing and closing a generate run. It should hide the implementation details of checkpoint-path derivation, standard-input materialization, fresh-run staging, resume validation, resume-plan lifecycle, and persistence-sink construction.
The refactor should preserve all public behavior. Users should see the same CLI flags, output JSONL shape, checkpoint file behavior, resume semantics, progress/status messages where practical, and failure messages unless the current wording is accidentally tied to implementation details. No checkpoint schema migration is intended.
Commits
Add a focused characterization test around the file-backed generate artifact lifecycle for a fresh JSONL run. The test should start from input rows and a fake generation client, then assert the visible output artifact and checkpoint artifact are created and agree on settled rows. Keep the existing implementation unchanged.
Add a characterization test for the no-output/stdout path. It should assert that a stdout-only generate run does not create checkpoint artifacts or resume planner artifacts. This protects the distinction between file-backed and stream-only execution before moving lifecycle code.
Add a characterization test for file-backed standard-input materialization. It should exercise a run with output configured but no prompt or input path, then assert the run succeeds and leaves no staged-source temporary file behind. The assertion should stay at the artifact boundary rather than reaching into implementation-specific helper calls.
Add a characterization test for fresh-run bootstrap failure preserving existing artifacts. The test should start with an existing output/checkpoint pair, trigger a setup failure before bootstrap can complete, and assert the old artifacts are unchanged.
Add a characterization test for resume with an explicit checkpoint directory. The test should create a prior run state, resume with the same checkpoint-directory setting, and assert only pending work is appended. It should also assert that resuming without the matching checkpoint-directory setting fails cleanly.
Add a characterization test for cleanup when resume planning fails. The test should force a planner or validation failure and assert temporary planner artifacts are removed. Prefer observing the temp directory over asserting private cleanup functions.
Introduce an internal lifecycle object that represents the prepared resources for one generate run. Initially, make it a thin wrapper around the existing setup return value so behavior is unchanged. It should own close/cleanup and expose only the scheduler-facing pieces needed to run work.
Move cleanup responsibility fully behind the lifecycle object. Existing callers should close the lifecycle object, not close individual resources. Keep the existing cleanup order and "raise the first cleanup error" behavior.
Move checkpoint path resolution and file-backed run detection into the lifecycle boundary. The caller should pass the raw generate-run options; the lifecycle boundary should decide whether an output/checkpoint pair is needed and whether standard input must be staged.
Move standard-input materialization behind the lifecycle boundary. The rest of the workflow should only see an effective replayable source when one is required.
Move fresh-run artifact staging behind the lifecycle boundary. Preserve the current behavior that checkpoint bootstrap occurs against temporary artifacts first, and visible artifacts are replaced only after bootstrap succeeds.
Move resume validation and resume-plan creation behind the lifecycle boundary. The scheduler should not need to know whether the prepared source uses sequential resume or a precomputed resume plan.
Move durable persistence-sink construction behind the lifecycle boundary. The scheduler should receive an optional sink-like dependency, but the rules for opening output and checkpoint files should stay inside the artifact lifecycle boundary.
Collapse status-message emission for setup into the lifecycle boundary. Keep user-facing status text stable where possible, and make status emission a dependency of lifecycle preparation rather than ad hoc calls from multiple places.
Rename internal types and methods around the new lifecycle boundary so their responsibilities are explicit. The names should describe artifact lifecycle and prepared run ownership, not low-level runtime setup.
Update the scheduler call site so it consumes the prepared lifecycle context through a narrow interface. It should only need a preparer, optional persistence sink, resume/no-work indication, and close semantics.
Reduce direct private-helper usage in workflow tests where boundary coverage now exists. Keep low-level tests only for logic that remains deliberately small and tricky, such as path derivation edge cases or SQLite schema validation, if those tests still justify their existence.
Update test fakes to avoid manufacturing checkpoint state through private implementation details where a boundary helper or prior run can create the state naturally. Keep explicit checkpoint-state construction only for tests that intentionally exercise corrupted or historical artifacts.
Run the workflow tests and fix any behavior-preserving regressions. At this step, keep fixes mechanical and scoped to the lifecycle refactor.
Run the CLI tests that cover generate/resume behavior and fix any CLI integration regressions. Confirm that environment-variable checkpoint directory behavior and command-line override behavior remain unchanged.
Run the full test suite. Any failures outside the generate workflow area should be investigated as accidental coupling and fixed without broadening the refactor.
Update contributor-facing documentation only if module names, maintenance guidance, or internal workflow descriptions have changed. Do not change user-facing README or guide behavior descriptions unless the refactor exposes a real documentation drift.
Decision Document
Testing Decisions
Out of Scope
Further Notes
This should be treated as a behavior-preserving refactor. The payoff is a smaller internal interface around a high-risk lifecycle, fewer tests coupled to private helper shapes, and an easier path for future endpoint workflow parity.