Skip to content

Conversation

@kumarUjjawal
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Sort-merge joins assumed both inputs had the same width when applying filter results. With different column counts (like right-anti joins) we built batches that no longer matched the output schema, so Arrow panicked during concat_batches.

What changes are included in this PR?

  • Fix the null-row and filtered-row builders in SortMergeJoinStream to slice the streamed vs buffered columns using the correct widths, keeping every batch consistent with self.schema.
  • Merge the semi/anti projection branch so it always projects the streamed-side columns expected in the output.

Are these changes tested?

Manual testing and ran previous tests.

Screenshot 2025-11-18 at 5 04 20 PM

Are there any user-facing changes?

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Nov 18, 2025
@kumarUjjawal
Copy link
Contributor Author

hmm, why is the test failing?

@rluvaton
Copy link
Member

Looks like this include fixes created by, let's wait for it to be merged

Also, once merged can you update the fuzz tests for join:
similar to what he added here but add the for loop to the other tests as well:
test_right_semi_join_1k_filtered, test_right_anti_join_1k_filtered, test_right_anti_join_1k_binary, test_right_anti_join_1k_binary_filtered
https://github.com/apache/datafusion/pull/18772/files#diff-f8330244c576c142337679ecbeeb99e9016c2c7c54d7bae492144374e138a0b7

@rluvaton
Copy link
Member

also, when fixing a bug, we require tests to be added that failed on main and pass on this pr

@kumarUjjawal
Copy link
Contributor Author

also, when fixing a bug, we require tests to be added that failed on main and pass on this pr

Sure will do.

@github-actions github-actions bot added the core Core DataFusion crate label Nov 19, 2025
.columns()
.iter()
.skip(right_columns_length)
.take(left_columns_length)
Copy link
Member

@rluvaton rluvaton Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add it as we can lose columns or hide bugs if we the rest of the columns are not left

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, should I Drop the .take(left_columns_length) and add a sanity check like assert_eq_or_internal_err!(..., left_columns_length, "...")) before extending right_columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Panic when having right anti join with filter in sort_merge_join and different number of columns between each side

2 participants