Skip to content

Stream ClickHouse segment indexing in bounded batches #21

@erskingardner

Description

@erskingardner

Problem

ClickHouseIndexer::read_segment currently reads and parses an entire segment into a Vec<EventRow> before insertion. Large compressed segments can decompress to hundreds of megabytes, and materializing all event rows can push memory much higher.

The ClickHouseConfig::batch_size field exists, but indexing currently writes all parsed rows through one insert flow after the full segment is loaded.

Why this matters

The archive is intended to scale to very large segment files and research-sized datasets. Segment processing tools and indexing should share a bounded-memory reader so large archives can be processed safely.

Relevant code

  • crates/pensieve-ingest/src/pipeline/clickhouse.rs
    • index_segment
    • read_segment
    • index_segment_file
  • docs/stability_fixes_plan.md already calls out this memory issue.

Suggested implementation

Introduce a streaming segment reader that yields records/events one at a time or in configured batches.

Then update ClickHouse indexing to:

  • read records incrementally
  • insert batches of batch_size
  • avoid retaining the entire segment in memory
  • reuse the same reader for research export tools where practical

Acceptance criteria

  • ClickHouse indexing respects ClickHouseConfig::batch_size.
  • Indexing does not materialize the full segment as one Vec<EventRow>.
  • Existing indexing behavior is preserved for valid segments.
  • Tests cover multi-batch segment indexing or the streaming reader directly.
  • Shared segment-reading code can be reused by export/inspect tools.
  • just precommit passes before merging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions