Rewrite backfill chunk planning with multi-strategy smart chunking by KeKs0r · Pull Request #108 · obsessiondb/chkit

KeKs0r · 2026-04-05T09:21:18Z

Summary

Replaces the old fixed-width time chunking with a multi-strategy planner that introspects partition layout, sort key distribution, and row estimates from ClickHouse metadata to produce better-sized chunks.
Adds six splitting strategies: equal-width, quantile ranges, temporal bucketing, string prefix splitting, group-by-key, and a hot-key refinement pass — selected automatically based on sort key type and data distribution.
Introduces service abstractions (metadata-source, distribution-source, row-probe) for testability and a boundary-codec for JSON-safe persistence of chunk ranges.
Moves programmatic chunking APIs to a dedicated sdk entry point (@chkit/plugin-backfill/sdk), keeping the main export surface clean.
Includes comprehensive integration tests, E2E test scaffolding, and playground benchmarks used during development.

Test plan

bun verify passes (typecheck, lint, test, build — 33/33)
E2E smart chunking tests against live ClickHouse Cloud
Manual plan command test on a real dataset

🤖 Generated with Claude Code

ClickHouse Cloud enables parallel replicas by default, which inflates count() results by the replica count (observed 35x over-count). Add SETTINGS enable_parallel_replicas=0 to all count queries used during chunk planning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extend the fix to min/max, GROUP BY prefix, and GROUP BY temporal bucket queries. Tested on ObsessionDB: GROUP BY counts are inflated by the replica count (16-35x), and min/max queries are ~5x slower with replicas on. Extract a shared DISABLE_PARALLEL_REPLICAS constant with a note that this is an ObsessionDB workaround. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extend ClickHouseExecutor.query() to accept per-query settings and thread querySettings through PlannerContext. The ObsessionDB workaround (enable_parallel_replicas=0) is now set once at the plugin call site instead of being appended to each SQL string. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The backfill moves uncompressed data, so chunk sizing should be based on uncompressed bytes rather than compressed. With ~8x compression ratios, using compressed bytes produced chunks ~8x larger than intended. All size comparisons, merge budgets, and row-target calculations now use bytesUncompressed. Test maxChunkBytes values doubled to match the 2x compression ratio in the test fixture. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Analysis showed 3x oversampling is sufficient for equal-width range splitting while reducing the number of estimation queries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

At root level (depth 0), use equal-width EXPLAIN ESTIMATE for the initial split — fast metadata-only probes instead of a full GROUP BY scan. Oversized children re-enter at depth 1+ and get GROUP BY prefix refinement on their narrowed sub-ranges. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace prefix-based hot key discovery with a direct GROUP BY key approach for sub-ranges with ≤100 distinct values. One query gives exact per-key counts instead of recursive depth drilling (1→4 chars). When a sub-range contains a single key, narrow the range to an exact match and re-enter dispatch so focusedValue propagates to subsequent dimension splitting (e.g. temporal buckets on the hot key). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The BigInt string boundary computation used a hardcoded 8-byte width, silently truncating sort key values longer than 8 characters. This caused chunk boundaries to miss rows at the end of the value range — e.g. a partition with tenants "mega-corp" (9 chars) through "tenant-0199" (11 chars) would lose all rows past the truncated upper bound "tenant-0". Replace the fixed width with a dynamic width derived from the actual input range strings (max of rangeFrom/rangeTo length). Also strip trailing null bytes from bigIntToStr output to avoid inflating boundary strings beyond their original length, while preserving semantically meaningful nulls via a minLength parameter. Adds E2E test infrastructure: a seed script to populate ClickHouse with controlled datasets, and an E2E test for the skewed power-law scenario (80/20 tenant distribution) that verifies hot key detection, cross- dimension splitting, estimate accuracy, and full row coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds scenario 2: three tenants at ~30% each with 10% long tail, verifying that multiple hot keys are independently detected and split on the secondary dimension. Renames seed script to .script.ts so bun test does not re-execute it on every test run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector · 2026-04-05T09:21:22Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Development artifacts: benchmark scripts for testing chunking strategies against live ClickHouse, query traces, and E2E scenario planning notes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

KeKs0r and others added 17 commits April 2, 2026 00:13

replace backfill chunking with smart planner

d2eb78f

fix smart chunking review issues

6d224b7

Update Algo

ab1239e

Fix CI

3f99820

Export backfill SDK helpers and obsessiondb service types

fbd3a19

Trace Queries with Performance.

78742a7

oversample the equal width

2340608

Reduce equal-width oversampling from 5x to 3x

38dc5d8

Analysis showed 3x oversampling is sufficient for equal-width range splitting while reducing the number of estimation queries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix equal-width oversample precision and add changeset

064d387

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add playground benchmarks and smart chunking research notes

f655fc3

Development artifacts: benchmark scripts for testing chunking strategies against live ClickHouse, query traces, and E2E scenario planning notes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite backfill chunk planning with multi-strategy smart chunking#108

Rewrite backfill chunk planning with multi-strategy smart chunking#108
KeKs0r wants to merge 18 commits intomainfrom
marc/rebase-smart-chunking

KeKs0r commented Apr 5, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

KeKs0r commented Apr 5, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector bot commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant