Skip to content

Add Parquet export path for research workflows #18

@erskingardner

Description

@erskingardner

Problem

JSONL is the most universal exchange format, but many academic and data science workflows expect columnar formats for large-scale analysis. Pensieve currently does not provide a direct way to convert segment archives into Parquet.

Why this matters

A Parquet export can make large Nostr archive snapshots much cheaper to scan in DuckDB, Spark, Polars, Arrow, and similar tools. This lowers the barrier for researchers working with full-network datasets.

Suggested implementation

Add a Parquet export path after the JSONL exporter and shared segment reader exist.

Possible shape:

  • Reuse common segment iteration code.
  • Write event fields to Arrow/Parquet with a documented schema.
  • Decide how to encode tags, likely as nested lists if supported cleanly, or as JSON strings if that keeps the first version simple.
  • Support partitioning by segment, kind, or created_at date in a later iteration if needed.

Acceptance criteria

  • Segment files can be exported to Parquet.
  • The Parquet schema is documented.
  • Export preserves all canonical Nostr event fields.
  • Export handles tags deterministically.
  • A small exported Parquet file can be read by DuckDB or an Arrow-compatible reader in tests or docs.
  • just precommit passes before merging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions