diff --git a/docs/README.md b/docs/README.md index b78e870..0224bf8 100644 --- a/docs/README.md +++ b/docs/README.md @@ -27,6 +27,7 @@ - [Reservation Engine Plan](./reservation-engine-plan.md) - [Reservation Engine Semantics](./reservation-semantics.md) - [Reservation Runtime Seam Evaluation](./reservation-runtime-seam-evaluation.md) +- [Runtime Extraction Roadmap](./runtime-extraction-roadmap.md) - [Revoke Safety Slice](./revoke-safety-slice.md) - [Operator Runbook](./operator-runbook.md) - [KubeVirt Jepsen Report](./kubevirt-jepsen-report.md) diff --git a/docs/runtime-extraction-roadmap.md b/docs/runtime-extraction-roadmap.md new file mode 100644 index 0000000..4cef94d --- /dev/null +++ b/docs/runtime-extraction-roadmap.md @@ -0,0 +1,148 @@ +# Runtime Extraction Roadmap + +## Purpose + +This document defines the path from the current engine family to something that can honestly be +called a general internal DB-building library. + +The current state is: + +- `allocdb-core`, `quota-core`, and `reservation-core` all exist on `main` +- the engine thesis is proven strongly enough +- a broad shared runtime is still premature +- `retire_queue` is the first justified micro-extraction candidate + +The goal is not to market a framework early. The goal is to extract only the runtime substrate that +has actually stabilized under multiple engines. + +## End State + +We should only call this a general internal DB-building library when all of the following are true: + +- more than one runtime module is shared cleanly across engines +- the shared-vs-domain boundary is explicit and stable +- a new engine or engine slice can be built with materially less copy-paste +- extraction reduces maintenance cost more than it adds abstraction cost + +Until then, the honest description remains: + +- multiple deterministic engines +- emerging shared runtime + +## Milestone Shape + +### M12: First Internal Runtime Extractions + +Goal: + +- extract the smallest runtime pieces that are already mechanically shared + +Scope: + +- `retire_queue` +- `wal` +- `wal_file` +- `snapshot_file` only if the file-level discipline stays separable from snapshot schemas + +Non-goals: + +- no public framework story +- no snapshot schema extraction +- no recovery API extraction +- no state-machine trait layer + +Exit criteria: + +- extracted modules are used by all applicable engines +- behavior is unchanged +- tests stay green without new abstraction leaks + +### M13: Internal Engine Authoring Contract + +Goal: + +- define the stable boundary between shared runtime and engine-local semantics + +Scope: + +- one internal runtime contract note +- explicit ownership of: + - bounded collections + - durable frame/file helpers + - snapshot-file discipline + - recovery helper seams, if any +- explicit non-ownership of: + - command schemas + - result surfaces + - snapshot schemas + - state-machine semantics + +Exit criteria: + +- the contract is clear enough that another engine authoring pass is constrained by it + +### M14: Fourth-Engine Or Reduced-Copy Proof + +Goal: + +- prove that the extracted substrate lowers authoring cost rather than only moving code around + +Acceptable proof shapes: + +- build a fourth engine against the extracted substrate, or +- retrofit one substantial new engine slice against the extracted substrate with clearly reduced + copy-paste and no correctness regression + +Exit criteria: + +- one new engine or engine slice uses the extracted substrate directly +- the reduction in duplicated runtime code is obvious +- the authoring contract survives contact with real implementation work + +## Recommended Issue Shape + +### M12 + +- `M12`: Extract the first internal shared runtime substrate from the three-engine family +- `M12-T01`: Extract shared `retire_queue` +- `M12-T02`: Extract shared `wal` +- `M12-T03`: Extract shared `wal_file` +- `M12-T04`: Evaluate and, if still clean, extract shared `snapshot_file` + +### M13 + +- `M13`: Define the internal engine authoring boundary after the first extractions +- `M13-T01`: Write the internal runtime-vs-engine contract +- `M13-T02`: Reassess whether a fourth-engine proof is still required or whether the extracted + substrate already lowered authoring cost enough + +### M14 + +- `M14`: Prove the extracted substrate lowers engine-authoring cost +- `M14-T01`: Build one new engine or engine slice against the extracted substrate +- `M14-T02`: Re-evaluate whether the repository can now honestly claim an internal DB-building + library + +## Execution Rules + +- extract smallest-first +- after each micro-extraction, stop and verify before continuing +- if one extraction introduces awkward generic plumbing, stop and reassess rather than force the + sequence +- keep domain logic local even if runtime discipline is shared + +## Current Recommendation + +Do this next: + +1. `M12-T01` shared `retire_queue` +2. `M12-T02` shared `wal` +3. `M12-T03` shared `wal_file` +4. only then decide whether `snapshot_file` is still clean enough to extract + +Do not do this next: + +- public framework branding +- generic state-machine APIs +- generic snapshot schemas +- extracting recovery entry points before the lower layers stabilize diff --git a/docs/status.md b/docs/status.md index b2ef7a6..d545312 100644 --- a/docs/status.md +++ b/docs/status.md @@ -1,6 +1,6 @@ # AllocDB Status ## Current State -- Phase: replicated implementation with external Jepsen gate closed, M9 lease-kernel follow-on live-validated, M10 second-engine proof merged, and M11 third-engine proof merged +- Phase: replicated implementation with external Jepsen gate closed, M9 lease-kernel follow-on live-validated, M10 second-engine proof merged, M11 third-engine proof merged, and M12 runtime-extraction roadmap staged - Planning IDs: tasks use `M#-T#`; spikes use `M#-S#` - Current milestone status: - `M0` semantics freeze: complete enough for core work @@ -16,6 +16,7 @@ - `M9` generic lease-kernel follow-on: implementation merged on `main` - `M10` second-engine proof: merged on `main`; shared runtime extraction deferred - `M11` third-engine proof: merged on `main`; broad shared runtime still deferred, first micro-extraction now justified + - `M12` first internal runtime extractions: planned - Latest completed implementation chunks: - `4156a80` `Bootstrap AllocDB core and docs` - `f84a641` `Add WAL file and snapshot recovery primitives` @@ -212,9 +213,8 @@ simulation coverage are now all in the mainline implementation - PR `#97` merged issue `#96`, extending Jepsen history generation and analysis for bundle reserve, revoke/reclaim, and stale-holder lease paths, then closing the loop with live KubeVirt - `lease_safety-control` and full `1800s` `lease_safety-crash-restart` evidence on `allocdb-a`, - both with `blockers=0` + `lease_safety-control` and full `1800s` `lease_safety-crash-restart` evidence on `allocdb-a` with `blockers=0` - the next recommended step remains downstream real-cluster e2e work such as `gpu_control_plane`, not more unplanned lease-kernel semantics work; the current deployment slice covers a first in-cluster `StatefulSet` shape, but bootstrap-primary routing, failover/rejoin orchestration, and background maintenance remain operator work, and the current staging unblock path is to publish `skel84/allocdb` from GitHub Actions rather than relying on the local Docker engine -- PR `#107` merged the `M10` quota-engine proof on `main`: `quota-core` now proves a second deterministic engine in-repo with bounded `CreateBucket` / `Debit`, logical-slot refill, and snapshot/WAL recovery; the `M10-T05` seam evaluation still concludes that shared runtime extraction is premature, with `retire_queue` the closest candidate and the rest still engine-local -- PRs `#116`, `#117`, and `#118` merged the full `M11` reservation-core chain on `main`: scaffold, deterministic hold lifecycle, logical-slot overdue expiry, and expiry/recovery proof are now in the mainline implementation -- PR `#118` also closes the third-engine readout: `retire_queue` is now the first justified internal extraction candidate across all three engines, while a broad `dsm-runtime` or public DB-building library is still premature; `wal`, `wal_file`, and `snapshot_file` are the next likely internal seams only after that micro-extraction lands +- PR `#107` merged the `M10` quota-engine proof on `main`, and PRs `#116`, `#117`, and `#118` merged the full `M11` reservation-core chain on `main`: the repository now has a second and third deterministic engine with bounded command sets, logical-slot refill/expiry, and snapshot/WAL recovery proofs +- the `M10-T05` and `M11-T05` readouts still defer broad shared-runtime extraction: `retire_queue` is the first justified internal extraction candidate, while `wal`, `wal_file`, and `snapshot_file` remain the next likely seams only after that micro-extraction lands +- the next roadmap is now explicit in `runtime-extraction-roadmap.md`: start with `retire_queue`, then `wal`, then `wal_file`, and only then decide whether `snapshot_file` is still clean enough to extract before defining the internal authoring contract and asking for a fourth-engine or reduced-copy proof