Rewrite backfill chunk planning with multi-strategy smart chunking#108
Open
Rewrite backfill chunk planning with multi-strategy smart chunking#108
Conversation
ClickHouse Cloud enables parallel replicas by default, which inflates count() results by the replica count (observed 35x over-count). Add SETTINGS enable_parallel_replicas=0 to all count queries used during chunk planning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend the fix to min/max, GROUP BY prefix, and GROUP BY temporal bucket queries. Tested on ObsessionDB: GROUP BY counts are inflated by the replica count (16-35x), and min/max queries are ~5x slower with replicas on. Extract a shared DISABLE_PARALLEL_REPLICAS constant with a note that this is an ObsessionDB workaround. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend ClickHouseExecutor.query() to accept per-query settings and thread querySettings through PlannerContext. The ObsessionDB workaround (enable_parallel_replicas=0) is now set once at the plugin call site instead of being appended to each SQL string. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The backfill moves uncompressed data, so chunk sizing should be based on uncompressed bytes rather than compressed. With ~8x compression ratios, using compressed bytes produced chunks ~8x larger than intended. All size comparisons, merge budgets, and row-target calculations now use bytesUncompressed. Test maxChunkBytes values doubled to match the 2x compression ratio in the test fixture. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Analysis showed 3x oversampling is sufficient for equal-width range splitting while reducing the number of estimation queries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
At root level (depth 0), use equal-width EXPLAIN ESTIMATE for the initial split — fast metadata-only probes instead of a full GROUP BY scan. Oversized children re-enter at depth 1+ and get GROUP BY prefix refinement on their narrowed sub-ranges. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace prefix-based hot key discovery with a direct GROUP BY key approach for sub-ranges with ≤100 distinct values. One query gives exact per-key counts instead of recursive depth drilling (1→4 chars). When a sub-range contains a single key, narrow the range to an exact match and re-enter dispatch so focusedValue propagates to subsequent dimension splitting (e.g. temporal buckets on the hot key). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The BigInt string boundary computation used a hardcoded 8-byte width, silently truncating sort key values longer than 8 characters. This caused chunk boundaries to miss rows at the end of the value range — e.g. a partition with tenants "mega-corp" (9 chars) through "tenant-0199" (11 chars) would lose all rows past the truncated upper bound "tenant-0". Replace the fixed width with a dynamic width derived from the actual input range strings (max of rangeFrom/rangeTo length). Also strip trailing null bytes from bigIntToStr output to avoid inflating boundary strings beyond their original length, while preserving semantically meaningful nulls via a minLength parameter. Adds E2E test infrastructure: a seed script to populate ClickHouse with controlled datasets, and an E2E test for the skewed power-law scenario (80/20 tenant distribution) that verifies hot key detection, cross- dimension splitting, estimate accuracy, and full row coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds scenario 2: three tenants at ~30% each with 10% long tail, verifying that multiple hot keys are independently detected and split on the secondary dimension. Renames seed script to .script.ts so bun test does not re-execute it on every test run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Development artifacts: benchmark scripts for testing chunking strategies against live ClickHouse, query traces, and E2E scenario planning notes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sdkentry point (@chkit/plugin-backfill/sdk), keeping the main export surface clean.Test plan
bun verifypasses (typecheck, lint, test, build — 33/33)plancommand test on a real dataset🤖 Generated with Claude Code