Problem
Canonical segment files intentionally store only Nostr events. They do not include ingest timestamp, relay provenance, source coverage, relay set, collection window, or software version metadata.
That is a good property for the archive format, but research releases still need sidecar metadata so academics can understand dataset provenance and limitations.
Why this matters
Researchers need to know what a snapshot represents: when it was collected, which relay strategy produced it, what gaps or biases may exist, and what software version generated the archive. Without this, analyses are harder to reproduce and easier to misinterpret.
Suggested implementation
Create a release-level metadata file, likely emitted alongside manifests.
Candidate fields:
- dataset/release id
- generated_at
- collection_start and collection_end, if known
- segment range included
- source mode: live relay, negentropy, JSONL backfill, protobuf backfill, or mixed
- seed relay list or relay database summary
- max relay count and discovery settings
- dedupe policy
- validation policy
- Pensieve git revision
- notepack git revision
- known limitations
This should be sidecar metadata, not embedded into canonical event records.
Acceptance criteria
- Research release metadata can be generated for a segment batch.
- Metadata includes software revisions and generation timestamp.
- Metadata describes collection/source mode and known limitations.
- Metadata does not alter canonical segment records.
- Docs explain how to interpret the metadata.
just precommit passes before merging.
Problem
Canonical segment files intentionally store only Nostr events. They do not include ingest timestamp, relay provenance, source coverage, relay set, collection window, or software version metadata.
That is a good property for the archive format, but research releases still need sidecar metadata so academics can understand dataset provenance and limitations.
Why this matters
Researchers need to know what a snapshot represents: when it was collected, which relay strategy produced it, what gaps or biases may exist, and what software version generated the archive. Without this, analyses are harder to reproduce and easier to misinterpret.
Suggested implementation
Create a release-level metadata file, likely emitted alongside manifests.
Candidate fields:
This should be sidecar metadata, not embedded into canonical event records.
Acceptance criteria
just precommitpasses before merging.