Skip to content

Add segment-to-JSONL export tool for research access #14

@erskingardner

Description

@erskingardner

Problem

Pensieve segment files are the canonical archive, but they are stored as length-prefixed notepack payloads, optionally wrapped in gzip. That is efficient for storage and ingestion, but not convenient for academics or other researchers who expect line-delimited Nostr JSON events.

Why this matters

Researchers should not need to understand Pensieve internals or notepack framing just to use an archive snapshot. A JSONL export tool gives them a universal, inspectable format while preserving notepack as the canonical storage format.

Suggested implementation

Add a CLI tool, likely under crates/pensieve-ingest/src/bin/, such as segment-to-jsonl.

The tool should:

  • Accept one segment file or a directory of segments.
  • Read both .notepack and .notepack.gz files.
  • Decode each [u32 little-endian length][notepack bytes] record.
  • Emit one canonical Nostr JSON event per output line.
  • Preserve deterministic ordering by segment filename and record order.
  • Surface parse/truncation errors with file and byte offset context.

Acceptance criteria

  • A .notepack segment exports to valid JSONL.
  • A .notepack.gz segment exports to valid JSONL.
  • Output lines contain valid Nostr event fields: id, pubkey, created_at, kind, tags, content, sig.
  • Directory input processes segments in lexical filename order.
  • Tests cover uncompressed, gzip-compressed, and truncated segment cases.
  • just precommit passes before merging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions