Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
- [Reservation Engine Semantics](./reservation-semantics.md)
- [Reservation Runtime Seam Evaluation](./reservation-runtime-seam-evaluation.md)
- [Runtime Extraction Roadmap](./runtime-extraction-roadmap.md)
- [Snapshot File Seam Evaluation](./snapshot-file-seam-evaluation.md)
- [Revoke Safety Slice](./revoke-safety-slice.md)
- [Operator Runbook](./operator-runbook.md)
- [KubeVirt Jepsen Report](./kubevirt-jepsen-report.md)
Expand Down
10 changes: 9 additions & 1 deletion docs/runtime-extraction-roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,8 @@ Scope:
- `retire_queue`
- `wal`
- `wal_file`
- `snapshot_file` only if the file-level discipline stays separable from snapshot schemas
- evaluate `snapshot_file`, but extract it only if the file-level discipline stays separable from
snapshot schemas across the full engine family

Non-goals:

Expand Down Expand Up @@ -140,6 +141,13 @@ Do this next:
3. `M12-T03` shared `wal_file`
4. only then decide whether `snapshot_file` is still clean enough to extract

Result:

- `retire_queue`, `wal`, and `wal_file` were extracted successfully
- `snapshot_file` was evaluated and deferred because the seam is still only clean inside the
`quota-core` / `reservation-core` pair, not across all three engines
- the next correct move is now `M13`

Do not do this next:

- public framework branding
Expand Down
122 changes: 122 additions & 0 deletions docs/snapshot-file-seam-evaluation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Snapshot File Seam Evaluation

## Purpose

This document closes `M12-T04` by evaluating whether `snapshot_file` is now clean enough to
extract as another shared internal runtime module after:

- shared `retire_queue`
- shared `wal`
- shared `wal_file`

The question is deliberately narrower than "can snapshot persistence be shared in theory?" The
question is whether the current three-engine code on `main` justifies a real extraction now.

## Decision

Do not extract a shared `snapshot_file` crate yet.

The seam is real only inside the smaller `quota-core` / `reservation-core` pair. It is not yet a
clean three-engine runtime boundary.

The correct outcome for this slice is:

- record that `snapshot_file` is not ready for extraction
- keep each engine's `snapshot_file` local
- move on to `M13`, the internal engine authoring boundary

## What Is Shared

All three engines share the same high-level persistence discipline:

- one snapshot file per engine
- temp-file write, sync, rename, and parent-directory sync
- snapshot bytes loaded before WAL replay
- fail-closed behavior on decode or integrity errors

That means there is still real family resemblance at the discipline level.

## Where The Seam Breaks

### `allocdb-core` uses a simpler file format

`allocdb-core` still stores only encoded snapshot bytes:

- no footer
- no checksum
- no explicit max-bytes bound
- decode-time corruption detection only

That is materially different from the newer engines.

### `quota-core` and `reservation-core` share a stronger format

`quota-core` and `reservation-core` both use the same stronger file-level discipline:

- footer magic
- persisted payload length
- CRC32C checksum
- explicit `max_snapshot_bytes`
- oversize rejection before decode

Those two modules are close enough to share helpers later, but that is not the same thing as a
repository-wide extraction candidate.

### The remaining commonality is below the current file wrapper

The shared part is mostly:

- temp-file naming
- write, sync, rename, and parent-directory sync
- footer read/write mechanics for the newer engines

But the live module boundary still mixes those mechanics with engine-specific constructor and error
surface choices:

- `allocdb-core` has no size-bound constructor argument
- `quota-core` and `reservation-core` expose integrity-specific error variants
- the three wrappers are still tied to engine-local snapshot schemas and recovery expectations

That makes a forced crate extraction likely to create awkward generic plumbing rather than reduce
maintenance cost.

## Why Extraction Is Premature

Extracting now would create a misleading shared layer:

- it would either erase the real allocdb-vs-quota/reservation format difference
- or it would introduce configuration branches that mostly exist to paper over that difference

That is the wrong direction for this roadmap. `M12` is about extracting only what is already
mechanically shared, not about normalizing divergent modules by force.

The current evidence supports:

- shared `retire_queue`
- shared `wal`
- shared `wal_file`

It does not yet support:

- shared `snapshot_file`

## What Would Change The Answer Later

Revisit this seam only if one of these becomes true:

- `allocdb-core` adopts the same footer/checksum/max-bytes discipline as the newer engines
- repeated snapshot-file fixes land independently in multiple engines
- a later authoring pass shows the snapshot-file helper boundary can stay below engine-local error
and schema surfaces

Until then, local duplication is still cheaper than a fake shared abstraction.

## Recommended Next Step

Treat `M12` as complete after this readout.

The next step is `M13`, not more extraction pressure:

1. define the internal engine authoring boundary
2. write the runtime-vs-engine contract
3. reassess whether a fourth-engine or reduced-copy proof is still required
4 changes: 2 additions & 2 deletions docs/status.md
Original file line number Diff line number Diff line change
Expand Up @@ -216,5 +216,5 @@
`lease_safety-control` and full `1800s` `lease_safety-crash-restart` evidence on `allocdb-a` with `blockers=0`
- the next recommended step remains downstream real-cluster e2e work such as `gpu_control_plane`, not more unplanned lease-kernel semantics work; the current deployment slice covers a first in-cluster `StatefulSet` shape, but bootstrap-primary routing, failover/rejoin orchestration, and background maintenance remain operator work, and the current staging unblock path is to publish `skel84/allocdb` from GitHub Actions rather than relying on the local Docker engine
- PR `#107` merged the `M10` quota-engine proof on `main`, and PRs `#116`, `#117`, and `#118` merged the full `M11` reservation-core chain on `main`: the repository now has a second and third deterministic engine with bounded command sets, logical-slot refill/expiry, and snapshot/WAL recovery proofs
- the `M10-T05` and `M11-T05` readouts still defer broad shared-runtime extraction: `retire_queue` is the first justified internal extraction candidate, while `wal`, `wal_file`, and `snapshot_file` remain the next likely seams only after that micro-extraction lands
- the next roadmap is now explicit in `runtime-extraction-roadmap.md`: start with `retire_queue`, then `wal`, then `wal_file`, and only then decide whether `snapshot_file` is still clean enough to extract before defining the internal authoring contract and asking for a fourth-engine or reduced-copy proof
- PRs `#132`, `#133`, and `#134` merged the first `M12` runtime extractions on `main`: `retire_queue`, `wal`, and `wal_file` are now shared internal substrate instead of copied engine-local modules, while `M12-T04` closed as a defer decision because `snapshot_file` is still only a clean seam inside the `quota-core` / `reservation-core` pair and `allocdb-core` keeps the simpler file format
- the next roadmap step is now `M13`: define the internal engine authoring boundary in `runtime-extraction-roadmap.md` and stop extraction pressure until that contract is written down
Loading