Skip to content

Document the segment archive format and research access workflow #19

@erskingardner

Description

@erskingardner

Problem

The segment format is currently discoverable by reading code, but researchers need concise documentation explaining what the files are, how to verify them, and how to convert them into analysis-friendly formats.

Why this matters

Academic access should not depend on source-code spelunking. Clear documentation will reduce support load and help researchers cite and reproduce archive snapshots correctly.

Suggested documentation

Add a document such as docs/segment_archive_format.md or docs/research_access.md covering:

  • filename convention: segment-000000000.notepack(.gz)
  • gzip wrapping behavior
  • record framing: [u32 little-endian length][notepack bytes]
  • notepack event payload summary
  • archive ordering by ingestion time, not created_at
  • dedupe semantics by event id
  • absence of relay provenance in canonical segments
  • recommended tools: JSONL export, inspect, verify, manifest
  • integrity checks with SHA-256 manifests
  • example commands

Acceptance criteria

  • Docs describe the segment file format precisely.
  • Docs explain what metadata is not included.
  • Docs include example conversion and verification commands.
  • Docs explain how manifests should be used.
  • Docs are linked from the README or another discoverable docs index.
  • just precommit passes before merging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions