Skip to content

Rewrite backfill chunk planning with multi-strategy smart chunking#108

Open
KeKs0r wants to merge 18 commits intomainfrom
marc/rebase-smart-chunking
Open

Rewrite backfill chunk planning with multi-strategy smart chunking#108
KeKs0r wants to merge 18 commits intomainfrom
marc/rebase-smart-chunking

Conversation

@KeKs0r
Copy link
Copy Markdown
Contributor

@KeKs0r KeKs0r commented Apr 5, 2026

Summary

  • Replaces the old fixed-width time chunking with a multi-strategy planner that introspects partition layout, sort key distribution, and row estimates from ClickHouse metadata to produce better-sized chunks.
  • Adds six splitting strategies: equal-width, quantile ranges, temporal bucketing, string prefix splitting, group-by-key, and a hot-key refinement pass — selected automatically based on sort key type and data distribution.
  • Introduces service abstractions (metadata-source, distribution-source, row-probe) for testability and a boundary-codec for JSON-safe persistence of chunk ranges.
  • Moves programmatic chunking APIs to a dedicated sdk entry point (@chkit/plugin-backfill/sdk), keeping the main export surface clean.
  • Includes comprehensive integration tests, E2E test scaffolding, and playground benchmarks used during development.

Test plan

  • bun verify passes (typecheck, lint, test, build — 33/33)
  • E2E smart chunking tests against live ClickHouse Cloud
  • Manual plan command test on a real dataset

🤖 Generated with Claude Code

KeKs0r and others added 17 commits April 2, 2026 00:13
ClickHouse Cloud enables parallel replicas by default, which inflates
count() results by the replica count (observed 35x over-count). Add
SETTINGS enable_parallel_replicas=0 to all count queries used during
chunk planning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend the fix to min/max, GROUP BY prefix, and GROUP BY temporal
bucket queries. Tested on ObsessionDB: GROUP BY counts are inflated
by the replica count (16-35x), and min/max queries are ~5x slower
with replicas on. Extract a shared DISABLE_PARALLEL_REPLICAS constant
with a note that this is an ObsessionDB workaround.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend ClickHouseExecutor.query() to accept per-query settings and
thread querySettings through PlannerContext. The ObsessionDB workaround
(enable_parallel_replicas=0) is now set once at the plugin call site
instead of being appended to each SQL string.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The backfill moves uncompressed data, so chunk sizing should be based on
uncompressed bytes rather than compressed. With ~8x compression ratios,
using compressed bytes produced chunks ~8x larger than intended. All size
comparisons, merge budgets, and row-target calculations now use
bytesUncompressed. Test maxChunkBytes values doubled to match the 2x
compression ratio in the test fixture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Analysis showed 3x oversampling is sufficient for equal-width range
splitting while reducing the number of estimation queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
At root level (depth 0), use equal-width EXPLAIN ESTIMATE for the
initial split — fast metadata-only probes instead of a full GROUP BY
scan. Oversized children re-enter at depth 1+ and get GROUP BY prefix
refinement on their narrowed sub-ranges.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace prefix-based hot key discovery with a direct GROUP BY key
approach for sub-ranges with ≤100 distinct values. One query gives
exact per-key counts instead of recursive depth drilling (1→4 chars).

When a sub-range contains a single key, narrow the range to an exact
match and re-enter dispatch so focusedValue propagates to subsequent
dimension splitting (e.g. temporal buckets on the hot key).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The BigInt string boundary computation used a hardcoded 8-byte width,
silently truncating sort key values longer than 8 characters. This caused
chunk boundaries to miss rows at the end of the value range — e.g. a
partition with tenants "mega-corp" (9 chars) through "tenant-0199" (11
chars) would lose all rows past the truncated upper bound "tenant-0".

Replace the fixed width with a dynamic width derived from the actual
input range strings (max of rangeFrom/rangeTo length). Also strip
trailing null bytes from bigIntToStr output to avoid inflating boundary
strings beyond their original length, while preserving semantically
meaningful nulls via a minLength parameter.

Adds E2E test infrastructure: a seed script to populate ClickHouse with
controlled datasets, and an E2E test for the skewed power-law scenario
(80/20 tenant distribution) that verifies hot key detection, cross-
dimension splitting, estimate accuracy, and full row coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds scenario 2: three tenants at ~30% each with 10% long tail, verifying
that multiple hot keys are independently detected and split on the
secondary dimension. Renames seed script to .script.ts so bun test does
not re-execute it on every test run.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Development artifacts: benchmark scripts for testing chunking strategies
against live ClickHouse, query traces, and E2E scenario planning notes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant