Problem
ClickHouseIndexer::read_segment currently reads and parses an entire segment into a Vec<EventRow> before insertion. Large compressed segments can decompress to hundreds of megabytes, and materializing all event rows can push memory much higher.
The ClickHouseConfig::batch_size field exists, but indexing currently writes all parsed rows through one insert flow after the full segment is loaded.
Why this matters
The archive is intended to scale to very large segment files and research-sized datasets. Segment processing tools and indexing should share a bounded-memory reader so large archives can be processed safely.
Relevant code
crates/pensieve-ingest/src/pipeline/clickhouse.rs
index_segment
read_segment
index_segment_file
docs/stability_fixes_plan.md already calls out this memory issue.
Suggested implementation
Introduce a streaming segment reader that yields records/events one at a time or in configured batches.
Then update ClickHouse indexing to:
- read records incrementally
- insert batches of
batch_size
- avoid retaining the entire segment in memory
- reuse the same reader for research export tools where practical
Acceptance criteria
- ClickHouse indexing respects
ClickHouseConfig::batch_size.
- Indexing does not materialize the full segment as one
Vec<EventRow>.
- Existing indexing behavior is preserved for valid segments.
- Tests cover multi-batch segment indexing or the streaming reader directly.
- Shared segment-reading code can be reused by export/inspect tools.
just precommit passes before merging.
Problem
ClickHouseIndexer::read_segmentcurrently reads and parses an entire segment into aVec<EventRow>before insertion. Large compressed segments can decompress to hundreds of megabytes, and materializing all event rows can push memory much higher.The
ClickHouseConfig::batch_sizefield exists, but indexing currently writes all parsed rows through one insert flow after the full segment is loaded.Why this matters
The archive is intended to scale to very large segment files and research-sized datasets. Segment processing tools and indexing should share a bounded-memory reader so large archives can be processed safely.
Relevant code
crates/pensieve-ingest/src/pipeline/clickhouse.rsindex_segmentread_segmentindex_segment_filedocs/stability_fixes_plan.mdalready calls out this memory issue.Suggested implementation
Introduce a streaming segment reader that yields records/events one at a time or in configured batches.
Then update ClickHouse indexing to:
batch_sizeAcceptance criteria
ClickHouseConfig::batch_size.Vec<EventRow>.just precommitpasses before merging.