Problem
Pensieve does not currently have a lightweight way to inspect archive segment files without indexing them into ClickHouse or writing ad hoc scripts.
Researchers and operators need quick answers like: how many records are in this segment, what kinds does it contain, what created_at range does it cover, how large are records, and is the file structurally valid?
Why this matters
Before sharing or processing research archives, we need a fast sanity-check tool for segment batches. This makes archive handoff safer and debugging easier.
Suggested implementation
Add a CLI tool such as segment-inspect.
The tool should report at least:
- segment path
- gzip/uncompressed status
- compressed bytes
- decompressed bytes
- record count
- min/max/average record size
- min/max
created_at
- kind histogram
- first and last event id
- structural validity status
- first parse/truncation error with byte offset, if invalid
Acceptance criteria
- Tool inspects
.notepack files.
- Tool inspects
.notepack.gz files.
- Tool reports record counts and byte sizes.
- Tool reports
created_at range and kind histogram.
- Tool exits non-zero for malformed/truncated segments.
- Tests cover valid and malformed segment inputs.
just precommit passes before merging.
Problem
Pensieve does not currently have a lightweight way to inspect archive segment files without indexing them into ClickHouse or writing ad hoc scripts.
Researchers and operators need quick answers like: how many records are in this segment, what kinds does it contain, what
created_atrange does it cover, how large are records, and is the file structurally valid?Why this matters
Before sharing or processing research archives, we need a fast sanity-check tool for segment batches. This makes archive handoff safer and debugging easier.
Suggested implementation
Add a CLI tool such as
segment-inspect.The tool should report at least:
created_atAcceptance criteria
.notepackfiles..notepack.gzfiles.created_atrange and kind histogram.just precommitpasses before merging.