Problem
JSONL is the most universal exchange format, but many academic and data science workflows expect columnar formats for large-scale analysis. Pensieve currently does not provide a direct way to convert segment archives into Parquet.
Why this matters
A Parquet export can make large Nostr archive snapshots much cheaper to scan in DuckDB, Spark, Polars, Arrow, and similar tools. This lowers the barrier for researchers working with full-network datasets.
Suggested implementation
Add a Parquet export path after the JSONL exporter and shared segment reader exist.
Possible shape:
- Reuse common segment iteration code.
- Write event fields to Arrow/Parquet with a documented schema.
- Decide how to encode
tags, likely as nested lists if supported cleanly, or as JSON strings if that keeps the first version simple.
- Support partitioning by segment, kind, or created_at date in a later iteration if needed.
Acceptance criteria
- Segment files can be exported to Parquet.
- The Parquet schema is documented.
- Export preserves all canonical Nostr event fields.
- Export handles
tags deterministically.
- A small exported Parquet file can be read by DuckDB or an Arrow-compatible reader in tests or docs.
just precommit passes before merging.
Problem
JSONL is the most universal exchange format, but many academic and data science workflows expect columnar formats for large-scale analysis. Pensieve currently does not provide a direct way to convert segment archives into Parquet.
Why this matters
A Parquet export can make large Nostr archive snapshots much cheaper to scan in DuckDB, Spark, Polars, Arrow, and similar tools. This lowers the barrier for researchers working with full-network datasets.
Suggested implementation
Add a Parquet export path after the JSONL exporter and shared segment reader exist.
Possible shape:
tags, likely as nested lists if supported cleanly, or as JSON strings if that keeps the first version simple.Acceptance criteria
tagsdeterministically.just precommitpasses before merging.