Skip to content

Fix segment numbering restart bug that can overwrite segment 0 #12

@erskingardner

Description

@erskingardner

Problem

SegmentWriter::find_next_segment_number scans existing segment files and tracks the highest numeric suffix, but it initializes max_num to 0 and only advances when max_num > 0.

That means an archive containing only segment-000000000.notepack or segment-000000000.notepack.gz is treated the same as an empty archive. On restart, the writer can choose segment number 0 again. Because new segments are opened with File::create, that can truncate/overwrite the existing first segment.

Why this matters

Pensieve treats segment files as the canonical archive. Accidentally reusing segment number 0 is a potential archive data-loss bug, especially for small/new deployments, tests, or academic export batches generated in a fresh output directory.

Relevant code

  • crates/pensieve-ingest/src/pipeline/segment.rs
    • find_next_segment_number
    • segment_path
    • ensure_current_segment

Suggested fix

Use an explicit sentinel such as Option<u64> instead of 0 to distinguish "no existing segment" from "highest existing segment is 0".

Also consider opening new segment files with create-new semantics so accidental filename reuse fails loudly instead of truncating archive data.

Acceptance criteria

  • A directory containing only segment-000000000.notepack starts the next writer at segment 1.
  • A directory containing only segment-000000000.notepack.gz starts the next writer at segment 1.
  • Existing segment files are not truncated on writer initialization.
  • Segment writer unit tests cover the restart case.
  • just precommit passes before merging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions