diff --git a/docs/README.md b/docs/README.md index 0224bf8..a10306f 100644 --- a/docs/README.md +++ b/docs/README.md @@ -28,6 +28,7 @@ - [Reservation Engine Semantics](./reservation-semantics.md) - [Reservation Runtime Seam Evaluation](./reservation-runtime-seam-evaluation.md) - [Runtime Extraction Roadmap](./runtime-extraction-roadmap.md) +- [Snapshot File Seam Evaluation](./snapshot-file-seam-evaluation.md) - [Revoke Safety Slice](./revoke-safety-slice.md) - [Operator Runbook](./operator-runbook.md) - [KubeVirt Jepsen Report](./kubevirt-jepsen-report.md) diff --git a/docs/runtime-extraction-roadmap.md b/docs/runtime-extraction-roadmap.md index 4cef94d..7c9177a 100644 --- a/docs/runtime-extraction-roadmap.md +++ b/docs/runtime-extraction-roadmap.md @@ -42,7 +42,8 @@ Scope: - `retire_queue` - `wal` - `wal_file` -- `snapshot_file` only if the file-level discipline stays separable from snapshot schemas +- evaluate `snapshot_file`, but extract it only if the file-level discipline stays separable from + snapshot schemas across the full engine family Non-goals: @@ -140,6 +141,13 @@ Do this next: 3. `M12-T03` shared `wal_file` 4. only then decide whether `snapshot_file` is still clean enough to extract +Result: + +- `retire_queue`, `wal`, and `wal_file` were extracted successfully +- `snapshot_file` was evaluated and deferred because the seam is still only clean inside the + `quota-core` / `reservation-core` pair, not across all three engines +- the next correct move is now `M13` + Do not do this next: - public framework branding diff --git a/docs/snapshot-file-seam-evaluation.md b/docs/snapshot-file-seam-evaluation.md new file mode 100644 index 0000000..d38063f --- /dev/null +++ b/docs/snapshot-file-seam-evaluation.md @@ -0,0 +1,122 @@ +# Snapshot File Seam Evaluation + +## Purpose + +This document closes `M12-T04` by evaluating whether `snapshot_file` is now clean enough to +extract as another shared internal runtime module after: + +- shared `retire_queue` +- shared `wal` +- shared `wal_file` + +The question is deliberately narrower than "can snapshot persistence be shared in theory?" The +question is whether the current three-engine code on `main` justifies a real extraction now. + +## Decision + +Do not extract a shared `snapshot_file` crate yet. + +The seam is real only inside the smaller `quota-core` / `reservation-core` pair. It is not yet a +clean three-engine runtime boundary. + +The correct outcome for this slice is: + +- record that `snapshot_file` is not ready for extraction +- keep each engine's `snapshot_file` local +- move on to `M13`, the internal engine authoring boundary + +## What Is Shared + +All three engines share the same high-level persistence discipline: + +- one snapshot file per engine +- temp-file write, sync, rename, and parent-directory sync +- snapshot bytes loaded before WAL replay +- fail-closed behavior on decode or integrity errors + +That means there is still real family resemblance at the discipline level. + +## Where The Seam Breaks + +### `allocdb-core` uses a simpler file format + +`allocdb-core` still stores only encoded snapshot bytes: + +- no footer +- no checksum +- no explicit max-bytes bound +- decode-time corruption detection only + +That is materially different from the newer engines. + +### `quota-core` and `reservation-core` share a stronger format + +`quota-core` and `reservation-core` both use the same stronger file-level discipline: + +- footer magic +- persisted payload length +- CRC32C checksum +- explicit `max_snapshot_bytes` +- oversize rejection before decode + +Those two modules are close enough to share helpers later, but that is not the same thing as a +repository-wide extraction candidate. + +### The remaining commonality is below the current file wrapper + +The shared part is mostly: + +- temp-file naming +- write, sync, rename, and parent-directory sync +- footer read/write mechanics for the newer engines + +But the live module boundary still mixes those mechanics with engine-specific constructor and error +surface choices: + +- `allocdb-core` has no size-bound constructor argument +- `quota-core` and `reservation-core` expose integrity-specific error variants +- the three wrappers are still tied to engine-local snapshot schemas and recovery expectations + +That makes a forced crate extraction likely to create awkward generic plumbing rather than reduce +maintenance cost. + +## Why Extraction Is Premature + +Extracting now would create a misleading shared layer: + +- it would either erase the real allocdb-vs-quota/reservation format difference +- or it would introduce configuration branches that mostly exist to paper over that difference + +That is the wrong direction for this roadmap. `M12` is about extracting only what is already +mechanically shared, not about normalizing divergent modules by force. + +The current evidence supports: + +- shared `retire_queue` +- shared `wal` +- shared `wal_file` + +It does not yet support: + +- shared `snapshot_file` + +## What Would Change The Answer Later + +Revisit this seam only if one of these becomes true: + +- `allocdb-core` adopts the same footer/checksum/max-bytes discipline as the newer engines +- repeated snapshot-file fixes land independently in multiple engines +- a later authoring pass shows the snapshot-file helper boundary can stay below engine-local error + and schema surfaces + +Until then, local duplication is still cheaper than a fake shared abstraction. + +## Recommended Next Step + +Treat `M12` as complete after this readout. + +The next step is `M13`, not more extraction pressure: + +1. define the internal engine authoring boundary +2. write the runtime-vs-engine contract +3. reassess whether a fourth-engine or reduced-copy proof is still required diff --git a/docs/status.md b/docs/status.md index d545312..7a6563a 100644 --- a/docs/status.md +++ b/docs/status.md @@ -216,5 +216,5 @@ `lease_safety-control` and full `1800s` `lease_safety-crash-restart` evidence on `allocdb-a` with `blockers=0` - the next recommended step remains downstream real-cluster e2e work such as `gpu_control_plane`, not more unplanned lease-kernel semantics work; the current deployment slice covers a first in-cluster `StatefulSet` shape, but bootstrap-primary routing, failover/rejoin orchestration, and background maintenance remain operator work, and the current staging unblock path is to publish `skel84/allocdb` from GitHub Actions rather than relying on the local Docker engine - PR `#107` merged the `M10` quota-engine proof on `main`, and PRs `#116`, `#117`, and `#118` merged the full `M11` reservation-core chain on `main`: the repository now has a second and third deterministic engine with bounded command sets, logical-slot refill/expiry, and snapshot/WAL recovery proofs -- the `M10-T05` and `M11-T05` readouts still defer broad shared-runtime extraction: `retire_queue` is the first justified internal extraction candidate, while `wal`, `wal_file`, and `snapshot_file` remain the next likely seams only after that micro-extraction lands -- the next roadmap is now explicit in `runtime-extraction-roadmap.md`: start with `retire_queue`, then `wal`, then `wal_file`, and only then decide whether `snapshot_file` is still clean enough to extract before defining the internal authoring contract and asking for a fourth-engine or reduced-copy proof +- PRs `#132`, `#133`, and `#134` merged the first `M12` runtime extractions on `main`: `retire_queue`, `wal`, and `wal_file` are now shared internal substrate instead of copied engine-local modules, while `M12-T04` closed as a defer decision because `snapshot_file` is still only a clean seam inside the `quota-core` / `reservation-core` pair and `allocdb-core` keeps the simpler file format +- the next roadmap step is now `M13`: define the internal engine authoring boundary in `runtime-extraction-roadmap.md` and stop extraction pressure until that contract is written down