Add selection vector repartitioning #15420

Dandandan · 2025-03-25T13:49:49Z

Is your feature request related to a problem or challenge?

Add a mode that outputs selection vectors (for now let's use dense boolean arrays so it can be added to RecordBatch) in RepartitionExec. The array outputs true for each row that has hash % total_partition == current_partition (and false if not).

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

acking-you · 2025-03-25T16:41:18Z

After reviewing the related issues, I'm very excited about these features. I'd like to keep a close eye on their implementation, as I feel I can learn a lot from them 🥰

goldmedal · 2025-03-25T17:33:12Z

The array outputs true for each row that has hash % partition == 0 (and false if not).

I don't understand why the formula is hash % partition == 0? IMO, hash % total_partition is the number of the portion it belongs. Maybe the formula should be hash % total_partition == current_partition?

Given the following data:

col1 | col2 | ... | hash % total_partition
-------------------------
data | data | ... | 2
data | data | ... | 1
data | data | ... | 2
data | data | ... | 0

The 0 partition will get

col1 | col2 | ... | selection
-------------------------
data | data | ... | false
data | data | ... | false
data | data | ... | false
data | data | ... | true

The 1 partition will get

col1 | col2 | ... | selection
-------------------------
data | data | ... | false
data | data | ... | true
data | data | ... | false
data | data | ... | false

The 2 partition will get

col1 | col2 | ... | selection
-------------------------
data | data | ... | true
data | data | ... | false
data | data | ... | true
data | data | ... | false

Then, the following plan can aggregate or join the record which selection is true.
Does it make sense?

goldmedal · 2025-03-25T17:49:08Z

I'm not sure, but I created a draft #15423 for my idea.

Dandandan · 2025-03-25T18:50:42Z

Oh you're right @goldmedal , the formula is hash % total_partition == current_partition

Dandandan added the enhancement label Mar 25, 2025

This was referenced Mar 25, 2025

Support zero copy hash repartitioning for Hash Join #15382

Open

Support zero copy hash repartitioning for Hash Aggregate #15383

Open

goldmedal linked a pull request Mar 25, 2025 that will close this issue

Introduce selection vector repartitioning #15423

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add selection vector repartitioning #15420

Add selection vector repartitioning #15420

Dandandan commented Mar 25, 2025 •

edited

Loading

acking-you commented Mar 25, 2025

goldmedal commented Mar 25, 2025

goldmedal commented Mar 25, 2025

Dandandan commented Mar 25, 2025

Add selection vector repartitioning #15420

Add selection vector repartitioning #15420

Comments

Dandandan commented Mar 25, 2025 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

acking-you commented Mar 25, 2025

goldmedal commented Mar 25, 2025

goldmedal commented Mar 25, 2025

Dandandan commented Mar 25, 2025

Dandandan commented Mar 25, 2025 •

edited

Loading