Skip to content

Add segment inspection tool for counts, sizes, ranges, and corruption checks #15

@erskingardner

Description

@erskingardner

Problem

Pensieve does not currently have a lightweight way to inspect archive segment files without indexing them into ClickHouse or writing ad hoc scripts.

Researchers and operators need quick answers like: how many records are in this segment, what kinds does it contain, what created_at range does it cover, how large are records, and is the file structurally valid?

Why this matters

Before sharing or processing research archives, we need a fast sanity-check tool for segment batches. This makes archive handoff safer and debugging easier.

Suggested implementation

Add a CLI tool such as segment-inspect.

The tool should report at least:

  • segment path
  • gzip/uncompressed status
  • compressed bytes
  • decompressed bytes
  • record count
  • min/max/average record size
  • min/max created_at
  • kind histogram
  • first and last event id
  • structural validity status
  • first parse/truncation error with byte offset, if invalid

Acceptance criteria

  • Tool inspects .notepack files.
  • Tool inspects .notepack.gz files.
  • Tool reports record counts and byte sizes.
  • Tool reports created_at range and kind histogram.
  • Tool exits non-zero for malformed/truncated segments.
  • Tests cover valid and malformed segment inputs.
  • just precommit passes before merging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions