Skip to content

[Enhancement] Route range-distribution OLAP tables by per-index distribution expressions#74753

Merged
xiangguangyxg merged 2 commits into
StarRocks:mainfrom
xiangguangyxg:range-per-index-distribution-routing-p1a
Jun 18, 2026
Merged

[Enhancement] Route range-distribution OLAP tables by per-index distribution expressions#74753
xiangguangyxg merged 2 commits into
StarRocks:mainfrom
xiangguangyxg:range-per-index-distribution-routing-p1a

Conversation

@xiangguangyxg

@xiangguangyxg xiangguangyxg commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Why I'm doing:

Range-distribution (shared-data) tables route rows to tablets by per-tablet boundaries stored in sort-key space, but the OLAP table sink could only carry a single partition-level distribution-column set. It therefore could not route different rows to materialized indexes that live in different key spaces. This is the missing sink piece for two future features — the K-tablet shadow-index rewrite job (key-column schema change) and range-distribution rollup — both of which need a base index and a new-key index to coexist in one partition and be routed independently.

What I'm doing:

Add per-index distribution routing to the sink:

  • thrift: new TOlapTableIndexSchema.distributed_exprs (field 9) carrying per-index routing expression trees, evaluated at the sink sender. Sender-only: POlapTableIndexSchema (proto) is unchanged, so remote write channels never route by it.
  • FE: OlapTableSink.createSchema fills distributed_exprs for range-distribution tables with slot-refs over each index's range sort-key columns, gated to the OLAP write-sink path (dictionary / non-write callers do not emit it). For today's base-only range tables this resolves to exactly the columns the partition-level path already used, so routing is behavior-preserving. Also adds an optional targetWriteIndexId filter (write only one index; schema, partition and loaded-index lists stay 1:1 by meta id).
  • BE: OlapTableSchemaParam parses distributed_exprs into per-index ExprContexts (prepare/open/close lifecycle); the range sink sender evaluates them once per chunk per index and routes via RangeRouter. RangeRouter::init validates routing-key types against the boundary types; a new route_chunk_rows overload routes from pre-evaluated columns; an empty distributed_exprs (K=1) routes to the single tablet. When an index has no distributed_exprs, routing falls back to the partition-level path unchanged.

No version gate is needed: StarRocks upgrades BE/CN before FE, so a newly-upgraded FE (the only one that emits the field) never runs against a BE that does not understand it, and an old FE never emits it. Non-range tables and any unset field are byte-for-byte unchanged. This is prerequisite-only: the new capability is dormant for existing tables and consumed by future work.

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5

🤖 Generated with Claude Code

…ibution expressions

Range-distribution tables route rows to tablets by per-tablet boundaries in
sort-key space, but the sink could only carry one partition-level
distribution-column set, so it could not route different rows to indexes that
live in different key spaces. This blocks the future K-tablet shadow-index
rewrite job and range-distribution rollup, both of which need a base index and a
new-key index to coexist in one partition.

Add per-index distribution routing:
- thrift: TOlapTableIndexSchema.distributed_exprs (field 9) carries per-index
  routing expression trees, evaluated at the sink sender. Sender-only:
  POlapTableIndexSchema is unchanged, so remote write channels never route by it.
- FE: OlapTableSink.createSchema fills distributed_exprs for range-distribution
  tables (slot-refs over each index's range sort-key columns), gated to the OLAP
  write-sink path (dictionary/non-write callers pass emitDistributedExprs=false).
  For today's base-only range tables this resolves to the same columns the
  partition-level path used, so routing is behavior-preserving. Adds an optional
  targetWriteIndexId filter (write only one index; schema, partition, and
  loaded-index lists stay 1:1 by meta id).
- BE: OlapTableSchemaParam parses distributed_exprs into per-index ExprContexts
  (prepare/open/close lifecycle); the range sink sender evaluates them once per
  chunk per index and routes via RangeRouter. RangeRouter::init validates the
  routing-key types against the boundary types; a new route_chunk_rows overload
  routes from pre-evaluated columns; an empty distributed_exprs (K=1) routes to
  the single tablet. When an index has no distributed_exprs, routing falls back
  to the partition-level path unchanged.

No version gate is needed: StarRocks upgrades BE/CN before FE, so a new FE never
runs against a BE that does not understand the field, and an old FE never emits
it. Non-range tables and any unset field are byte-for-byte unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wanpengfei-git wanpengfei-git requested a review from a team June 13, 2026 01:42
@CelerData-Reviewer

Copy link
Copy Markdown

@codex review

@github-actions github-actions Bot requested review from meegoo and srlch June 13, 2026 01:46
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Nice work!

Reviewed commit: b17d19954f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions

Copy link
Copy Markdown
Contributor

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions

Copy link
Copy Markdown
Contributor

[FE Incremental Coverage Report]

pass : 42 / 45 (93.33%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/planner/OlapTableSink.java 39 42 92.86% [289, 290, 645]
🔵 com/starrocks/service/FrontendServiceImpl.java 2 2 100.00% []
🔵 com/starrocks/planner/DictionaryCacheSink.java 1 1 100.00% []

@github-actions

Copy link
Copy Markdown
Contributor

[BE Incremental Coverage Report]

pass : 75 / 80 (93.75%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/exec/data_sinks/range_tablet_sink_sender.cpp 34 38 89.47% [44, 111, 112, 114]
🔵 be/src/exec/data_sinks/range_router.cpp 29 30 96.67% [50]
🔵 be/src/exec/tablet_info.cpp 12 12 100.00% []

@xiangguangyxg

Copy link
Copy Markdown
Contributor Author

Real-cluster E2E validation (shared-data)

Built this PR on our internal test platform and validated on a fresh shared-data cluster (1 FE + 3 CN, this PR's build).

What's validated: the per-index distributed_exprs routing path — for each materialized index of a range-distribution table, rows must be routed to the correct range tablet by that index's key. Oracle per case: load a deterministic key pattern → split into K range tablets → write new rows that span the tablet boundaries → assert (a) data conservation (COUNT/SUM vs analytic), (b) every row sits inside the tablet whose range covers it — 0 misplaced via TABLET() hint, (c) per-tablet counts are exact and sum to the total.

Case Key type tablets Checks Result
DUPLICATE INT (k1,k2) 2 conservation + post-split write routing (0 misplaced)
PRIMARY KEY INT (k1,k2) 3 dedup + conservation + routing (0/0/0) + dedup update
DUPLICATE VARCHAR(20) leading key 5 byte-ordered routing (0×5) + conservation
AGGREGATE INT (k1,k2), v1 SUM 1 aggregate-on-write correctness

Each "post-split write" exercises the new sink per-index routing into multiple range tablets. Example (PK): table split at boundaries (39322,7) / (78644,3) into 3 tablets, then 100k new rows spanning all three were written → 0 rows misplaced, per-tablet counts exact, total conserved at 1.1M rows; a re-write of existing keys confirmed PK upsert. VARCHAR split into 5 tablets at zero-padded boundaries (026215,052429,078644,098305) confirmed byte-ordered routing with 0 misplaced.

Range-distribution routing is behavior-preserving for base-only tables; these results confirm loads / queries / splits stay correct with the new per-index routing path active (no regression).

Guard sanity (existing range sort-key guards still hold on this build):

  • MODIFY COLUMN k1 BIGINT on a range sort-key → rejected
  • MODIFY COLUMN k1 VARCHAR(40) (widen) → allowed
  • ADD ROLLUP on a range table → rejected

Unit tests: FE OlapTableSinkTest (26) and BE RangeRouterTest + TabletSinkSenderRange* + OlapTablePartitionParamTest (incl. per-index routing, K=1 single-tablet, key-type validation) all pass.

@github-actions github-actions Bot added the 4.1 label Jun 13, 2026
@xiangguangyxg xiangguangyxg requested a review from kevincai June 13, 2026 09:01
@kevincai

Copy link
Copy Markdown
Contributor

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Comment thread fe/fe-core/src/main/java/com/starrocks/planner/OlapTableSink.java
@xiangguangyxg xiangguangyxg requested a review from srlch June 18, 2026 08:43
@xiangguangyxg xiangguangyxg enabled auto-merge (squash) June 18, 2026 09:19
@xiangguangyxg xiangguangyxg merged commit 60a751b into StarRocks:main Jun 18, 2026
103 of 106 checks passed
@github-actions

Copy link
Copy Markdown
Contributor

@Mergifyio backport branch-4.1

@github-actions github-actions Bot removed the 4.1 label Jun 18, 2026
@mergify

mergify Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

backport branch-4.1

✅ Backports have been created

Details

Cherry-pick of 60a751b has failed:

On branch mergify/bp/branch-4.1/pr-74753
Your branch is up to date with 'origin/branch-4.1'.

You are currently cherry-picking commit 60a751bdaa.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   be/src/exec/range_router.cpp
	modified:   be/src/exec/range_router.h
	modified:   be/src/exec/range_tablet_sink_sender.cpp
	modified:   be/src/exec/range_tablet_sink_sender.h
	modified:   be/src/exec/tablet_info.h
	modified:   be/test/exec/range_router_test.cpp
	modified:   be/test/exec/tablet_info_test.cpp
	modified:   fe/fe-core/src/main/java/com/starrocks/planner/DictionaryCacheSink.java
	modified:   fe/fe-core/src/main/java/com/starrocks/planner/OlapTableSink.java
	modified:   fe/fe-core/src/main/java/com/starrocks/service/FrontendServiceImpl.java
	modified:   fe/fe-core/src/test/java/com/starrocks/planner/OlapTableSinkTest.java
	modified:   fe/fe-core/src/test/java/com/starrocks/planner/OlapTableSinkTest2.java
	modified:   gensrc/thrift/Descriptors.thrift

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   be/src/exec/tablet_info.cpp
	both modified:   be/test/exec/tablet_sink_sender_range_test.cpp

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@xiangguangyxg xiangguangyxg deleted the range-per-index-distribution-routing-p1a branch June 18, 2026 11:27
xiangguangyxg added a commit that referenced this pull request Jun 18, 2026
…ibution expressions (#74753)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: wanpengfei-git <wanpengfei91@163.com>
(cherry picked from commit 60a751b)
xiangguangyxg added a commit that referenced this pull request Jun 18, 2026
…ibution expressions (#74753)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: wanpengfei-git <wanpengfei91@163.com>
(cherry picked from commit 60a751b)
wanpengfei-git added a commit that referenced this pull request Jun 18, 2026
…ibution expressions (backport #74753) (#75013)

Co-authored-by: xiangguangyxg <110401425+xiangguangyxg@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: wanpengfei-git <wanpengfei91@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants