Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add selection vector repartitioning #15420

Open
Tracked by #15382 ...
Dandandan opened this issue Mar 25, 2025 · 4 comments · May be fixed by #15423
Open
Tracked by #15382 ...

Add selection vector repartitioning #15420

Dandandan opened this issue Mar 25, 2025 · 4 comments · May be fixed by #15423
Labels
enhancement New feature or request

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Mar 25, 2025

Is your feature request related to a problem or challenge?

Add a mode that outputs selection vectors (for now let's use dense boolean arrays so it can be added to RecordBatch) in RepartitionExec. The array outputs true for each row that has hash % total_partition == current_partition (and false if not).

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@acking-you
Copy link
Contributor

After reviewing the related issues, I'm very excited about these features. I'd like to keep a close eye on their implementation, as I feel I can learn a lot from them 🥰

@goldmedal
Copy link
Contributor

The array outputs true for each row that has hash % partition == 0 (and false if not).

I don't understand why the formula is hash % partition == 0? IMO, hash % total_partition is the number of the portion it belongs. Maybe the formula should be hash % total_partition == current_partition?

Given the following data:

col1 | col2 | ... | hash % total_partition
-------------------------
data | data | ... | 2
data | data | ... | 1
data | data | ... | 2
data | data | ... | 0

The 0 partition will get

col1 | col2 | ... | selection
-------------------------
data | data | ... | false
data | data | ... | false
data | data | ... | false
data | data | ... | true

The 1 partition will get

col1 | col2 | ... | selection
-------------------------
data | data | ... | false
data | data | ... | true
data | data | ... | false
data | data | ... | false

The 2 partition will get

col1 | col2 | ... | selection
-------------------------
data | data | ... | true
data | data | ... | false
data | data | ... | true
data | data | ... | false

Then, the following plan can aggregate or join the record which selection is true.
Does it make sense?

@goldmedal goldmedal linked a pull request Mar 25, 2025 that will close this issue
@goldmedal
Copy link
Contributor

I'm not sure, but I created a draft #15423 for my idea.

@Dandandan
Copy link
Contributor Author

Oh you're right @goldmedal , the formula is hash % total_partition == current_partition

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants