Problem
SegmentWriter::find_next_segment_number scans existing segment files and tracks the highest numeric suffix, but it initializes max_num to 0 and only advances when max_num > 0.
That means an archive containing only segment-000000000.notepack or segment-000000000.notepack.gz is treated the same as an empty archive. On restart, the writer can choose segment number 0 again. Because new segments are opened with File::create, that can truncate/overwrite the existing first segment.
Why this matters
Pensieve treats segment files as the canonical archive. Accidentally reusing segment number 0 is a potential archive data-loss bug, especially for small/new deployments, tests, or academic export batches generated in a fresh output directory.
Relevant code
crates/pensieve-ingest/src/pipeline/segment.rs
find_next_segment_number
segment_path
ensure_current_segment
Suggested fix
Use an explicit sentinel such as Option<u64> instead of 0 to distinguish "no existing segment" from "highest existing segment is 0".
Also consider opening new segment files with create-new semantics so accidental filename reuse fails loudly instead of truncating archive data.
Acceptance criteria
- A directory containing only
segment-000000000.notepack starts the next writer at segment 1.
- A directory containing only
segment-000000000.notepack.gz starts the next writer at segment 1.
- Existing segment files are not truncated on writer initialization.
- Segment writer unit tests cover the restart case.
just precommit passes before merging.
Problem
SegmentWriter::find_next_segment_numberscans existing segment files and tracks the highest numeric suffix, but it initializesmax_numto0and only advances whenmax_num > 0.That means an archive containing only
segment-000000000.notepackorsegment-000000000.notepack.gzis treated the same as an empty archive. On restart, the writer can choose segment number0again. Because new segments are opened withFile::create, that can truncate/overwrite the existing first segment.Why this matters
Pensieve treats segment files as the canonical archive. Accidentally reusing segment number
0is a potential archive data-loss bug, especially for small/new deployments, tests, or academic export batches generated in a fresh output directory.Relevant code
crates/pensieve-ingest/src/pipeline/segment.rsfind_next_segment_numbersegment_pathensure_current_segmentSuggested fix
Use an explicit sentinel such as
Option<u64>instead of0to distinguish "no existing segment" from "highest existing segment is 0".Also consider opening new segment files with create-new semantics so accidental filename reuse fails loudly instead of truncating archive data.
Acceptance criteria
segment-000000000.notepackstarts the next writer at segment1.segment-000000000.notepack.gzstarts the next writer at segment1.just precommitpasses before merging.