AsymmetricJoinSizer passes wrong buildSize to JoinInfo #12354

binmahone · 2025-03-20T08:24:02Z

this PR closes #12353

Today, buildSize is inaccurate because fetchProbeTargetSize will stop when the size is already exceeding gpu batch size. We'll add a parameter named truncateIfNecessary into fetchProbeTargetSize, so that when it's false, it will exhaust the input iter , put all of the retrieved batches into the input queue, so that the byte size of the remaining batches can be calculated, returned, and added up the the original buildSize to correct it.

When we're sure the build side is going to be used as build side, and also we know for sure that the build side is going to be large. Then we'll call fetchProbeTargetSize with truncateIfNecessary=false, so that the buildSize will be accurate.

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone · 2025-03-20T08:29:47Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSizedHashJoinExec.scala

firestarman

Some tests failed

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone · 2025-03-20T15:09:56Z

build

binmahone · 2025-03-20T22:51:33Z

@firestarman CI fixed, please take another look

abellina

@binmahone can we add a bit more detail in the description of this PR. It's not obvious how the code change closes the issue.

binmahone · 2025-03-21T00:02:39Z

@abellina PR description refined, pls take another look

firestarman · 2025-03-21T01:41:12Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSizedHashJoinExec.scala

+          // We should provide the actual buildSize instead of the truncated one.
+          // By calling fetchProbeTargetSize again, we'll move all batches to the queue, and
+          // at the same time we'll get the actual buildSize.
+          val (_, remainingBytes) = closeOnExcept(buildQueue) { _ =>


I am not a big fan of this solution which moves all the remaining bathes to the queue after a call to setupForJoin, since it will change the queue but setupForJoin can not sense. Then this forces a limit on setupForJoin that it must refer to the input queue directly.

Shall we have a follow-up issue to eliminate this limit ?

hi @firestarman , the reason why I'm not draining the iterator after setupForJoin is because the iterator returned by setupForJoin is unspillable, whereras what we have in the queue is still spillable. The current implementation, as you pointed out, might be a little bit tricky and imposes some contraints on how we should use the queue, but it can save the cost of registering the batch again. I can add some comments to remind people to be careful about the queue, what you think?

Filed #12355 for this.

Note that the queue here is not spillable (HostHostAsymmetricJoinSizer). It is host memory, afaik, but still adds another place where we need to go back later and try to control the memory usage.

For kudo case, the elements in the queue is moslty KudoSerializedTableColumn, which has already changed to spillable after #12215 @abellina . But for non-kudo case you're right. Pls correct me if I'm wrong

Ok I think we need to resolve that these are not spillable for the host. Kudo isn't going to be on by default in 25.04, and this change will be on by default for all users.

I think we are going to have to solve the issue of queue not holding spillable elements. We will need to inspect each batch as we pop them from the iterator, and decide: is it host batch? then we need it to be a SpillableHostColumnarBatch, is it a gpu batch? then we need to hold a SpillableColumnarBatch. If it is Kudo, then we do that differently, but it should be consistent, all ColumnarBatch instances in that case would be Kudo.

If we don't solve the above, then we run the risk of running out of memory on the host where we didn't use to before.

firestarman

LGTM, but better have more reviews.

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone · 2025-03-21T02:13:04Z

LGTM, but better have more reviews.

more comment added

firestarman

LGTM

binmahone · 2025-03-24T00:54:09Z

build

abellina · 2025-03-24T14:31:07Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSizedHashJoinExec.scala

          fetchProbeTargetSize(probeStreamIter, streamQueue, gpuBatchSizeBytes)
        }
        val streamIter = setupForJoin(streamQueue, rawStreamIter, exprs.streamTypes,
          gpuBatchSizeBytes, metrics)
-        if (streamRows <= Int.MaxValue && streamSize <= gpuBatchSizeBytes) {
+        if (mayTruncatedStreamRows <= Int.MaxValue && mayTruncatedStreamSize <= gpuBatchSizeBytes) {
          assert(!probeStreamIter.hasNext, "stream side not exhausted")


I think if we have a stream side that is exactly gpuBatchSizeBytes or exactly Int.MaxValue then this assert will trigger. Not specific to your PR, but the question is, is that possible, and what happens in that case? It seems our metric (BUILD_DATA_SIZE) would be wrong, at least.

I'm afraid not. In your described case, the probeStreamIter will be drained in fetchProbeTargetSize

revans2

The changes look okay to me. But I don't see any benchmark results. An artificial benchmark would be fine.

The reason I am asking is because this code change makes it much more likely that we are going to spill the build side of the join twice. It limits that spilling to be twice, but I fear that the common case may be much larger than before.

If I remember correctly we have two different ways of partitioning the build side of the join.

BuildSidePartitioner assumes that it knows the size up front and will partition the data according to that, and do one pass through the build/stream side to partition the data.

GpuSubPartitionPairIterator assumes that it has no knowledge about the size of the build table and will partition it a configurable 16 ways recursively. In theory It could do multiple passes through the data, but in practice it will do one + possibly some partial passes.

If we don't fully know the build size I think GpuSubPartitionPairIterator is a better choice.

Long term I would like a single partitioning class that optionally takes a build side size, and then uses that information to make proper decisions on how to do the join. That would even possibly let us play games with AQE data to try and get the right partition number without ever toughing the data. In the short term I am fine with something less clean, but I want to see some numbers about how this impacts the common case.

Please correct me if I am wrong about this.

binmahone · 2025-03-25T03:16:37Z

The changes look okay to me. But I don't see any benchmark results. An artificial benchmark would be fine.

The reason I am asking is because this code change makes it much more likely that we are going to spill the build side of the join twice. It limits that spilling to be twice, but I fear that the common case may be much larger than before.

If I remember correctly we have two different ways of partitioning the build side of the join.

BuildSidePartitioner assumes that it knows the size up front and will partition the data according to that, and do one pass through the build/stream side to partition the data.

GpuSubPartitionPairIterator assumes that it has no knowledge about the size of the build table and will partition it a configurable 16 ways recursively. In theory It could do multiple passes through the data, but in practice it will do one + possibly some partial passes.

If we don't fully know the build size I think GpuSubPartitionPairIterator is a better choice.

Long term I would like a single partitioning class that optionally takes a build side size, and then uses that information to make proper decisions on how to do the join. That would even possibly let us play games with AQE data to try and get the right partition number without ever toughing the data. In the short term I am fine with something less clean, but I want to see some numbers about how this impacts the common case.

Please correct me if I am wrong about this.

Hi @revans2 , a few questions:

By artificial benchmark do you mean a worst case possible (very limited memory and so a lot of memory spills)? Or do you simply mean a AB test on NDS?
Can we finish the benchmark in a separate issue? I hope this PR could be merged ASAP (considering it's a reliability fix)
Under what situation will we have confidence that we have "knowledge about the size of the build table" ?

revans2 · 2025-03-25T17:10:41Z

By artificial benchmark do you mean a worst case possible (very limited memory and so a lot of memory spills)? Or do you simply mean a AB test on NDS?

I was thinking of something closer to a worst case. Something where we would hit this code, because I doubt NDS will ever exercise it. scale factor 3k is just too small.

Can we finish the benchmark in a separate issue? I hope this PR could be merged ASAP (considering it's a reliability fix)

I am fine if we put this in as is, but then we need a follow on issue early in 25.06 to look at the performance of it.

Under what situation will we have confidence that we have "knowledge about the size of the build table" ?

There are lots of cases where it is possible. We could have two cutoffs for pulling in data. Right now we go to 1 target batch size per side. It might be better if we pulled in up to 2x a target batch size because it covers something like 99% of all of the join cases we see. Then we would know. Another possibility is that we would use metrics from AQE for shuffles to estimate the size of the build table. Then we would have some "knowledge" even if it is not perfect.

revans2 · 2025-03-25T17:14:06Z

Sorry I forgot about the comment from @abellina

#12354 (comment)

I'm not sure how I feel about us pulling the entire build side into host memory. Especially in the case where we have no limits on the amount of host memory being used. I am going to be out for a few days, but if you can convince @abellina that the fix is okay, then I am fine with it.

binmahone · 2025-03-25T23:00:43Z

BuildSidePartitioner

the follow up issue is in #12387

binmahone · 2025-03-25T23:01:30Z

Sorry I forgot about the comment from @abellina

#12354 (comment)

I'm not sure how I feel about us pulling the entire build side into host memory. Especially in the case where we have no limits on the amount of host memory being used. I am going to be out for a few days, but if you can convince @abellina that the fix is okay, then I am fine with it.

hi @abellina need your input on this

abellina · 2025-03-27T03:14:44Z

@binmahone please see #12354 (comment)

AsymmetricJoinSizer passes wrong buildSize to JoinInfo

7cdfe4c

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone requested review from gerashegalov and firestarman March 20, 2025 08:24

firestarman approved these changes Mar 20, 2025

View reviewed changes

firestarman reviewed Mar 20, 2025

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSizedHashJoinExec.scala Show resolved Hide resolved

firestarman reviewed Mar 20, 2025

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledSizedHashJoinExec.scala Show resolved Hide resolved

firestarman requested changes Mar 20, 2025

View reviewed changes

fix bug

f554c41

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

abellina requested changes Mar 20, 2025

View reviewed changes

firestarman reviewed Mar 21, 2025

View reviewed changes

firestarman mentioned this pull request Mar 21, 2025

[FEA] Eliminate the limit that setupForJoin should refer to the input queue directly. #12355

Open

firestarman previously approved these changes Mar 21, 2025

View reviewed changes

add more comment to remind misusage of queue in setupForJoin

d343605

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone dismissed firestarman’s stale review via d343605 March 21, 2025 02:12

firestarman approved these changes Mar 21, 2025

View reviewed changes

binmahone requested a review from abellina March 24, 2025 13:07

abellina reviewed Mar 24, 2025

View reviewed changes

revans2 reviewed Mar 24, 2025

View reviewed changes

sameerz added the bug label Mar 25, 2025

binmahone mentioned this pull request Mar 25, 2025

[FEA] revisit on partitioning of BuildSidePartitioner #12387

Open

binmahone mentioned this pull request Mar 26, 2025

Allow BigSizedJoinIterator#buildPartitioner to produce more subparittions #12372

Merged

This was referenced Mar 31, 2025

[FEA] make inputs of AsymmetricJoinSizer spillable for non kudo cases #12417

Open

make inputs of AsymmetricJoinSizer spillable for non kudo cases #12418

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AsymmetricJoinSizer passes wrong buildSize to JoinInfo #12354

AsymmetricJoinSizer passes wrong buildSize to JoinInfo #12354

binmahone commented Mar 20, 2025 •

edited

Loading

binmahone commented Mar 20, 2025

firestarman left a comment

binmahone commented Mar 20, 2025

binmahone commented Mar 20, 2025

abellina left a comment

binmahone commented Mar 21, 2025

firestarman Mar 21, 2025 •

edited

Loading

binmahone Mar 21, 2025

firestarman Mar 21, 2025 •

edited

Loading

abellina Mar 24, 2025

binmahone Mar 25, 2025

abellina Mar 27, 2025

firestarman left a comment

binmahone commented Mar 21, 2025

firestarman left a comment

binmahone commented Mar 24, 2025

abellina Mar 24, 2025

binmahone Mar 25, 2025

revans2 left a comment

binmahone commented Mar 25, 2025 •

edited

Loading

revans2 commented Mar 25, 2025

revans2 commented Mar 25, 2025

binmahone commented Mar 25, 2025 •

edited

Loading

binmahone commented Mar 25, 2025

abellina commented Mar 27, 2025

AsymmetricJoinSizer passes wrong buildSize to JoinInfo #12354

Are you sure you want to change the base?

AsymmetricJoinSizer passes wrong buildSize to JoinInfo #12354

Conversation

binmahone commented Mar 20, 2025 • edited Loading

binmahone commented Mar 20, 2025

firestarman left a comment

Choose a reason for hiding this comment

binmahone commented Mar 20, 2025

binmahone commented Mar 20, 2025

abellina left a comment

Choose a reason for hiding this comment

binmahone commented Mar 21, 2025

firestarman Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

binmahone Mar 21, 2025

Choose a reason for hiding this comment

firestarman Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

abellina Mar 24, 2025

Choose a reason for hiding this comment

binmahone Mar 25, 2025

Choose a reason for hiding this comment

abellina Mar 27, 2025

Choose a reason for hiding this comment

firestarman left a comment

Choose a reason for hiding this comment

binmahone commented Mar 21, 2025

firestarman left a comment

Choose a reason for hiding this comment

binmahone commented Mar 24, 2025

abellina Mar 24, 2025

Choose a reason for hiding this comment

binmahone Mar 25, 2025

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

binmahone commented Mar 25, 2025 • edited Loading

revans2 commented Mar 25, 2025

revans2 commented Mar 25, 2025

binmahone commented Mar 25, 2025 • edited Loading

binmahone commented Mar 25, 2025

abellina commented Mar 27, 2025

binmahone commented Mar 20, 2025 •

edited

Loading

firestarman Mar 21, 2025 •

edited

Loading

firestarman Mar 21, 2025 •

edited

Loading

binmahone commented Mar 25, 2025 •

edited

Loading

binmahone commented Mar 25, 2025 •

edited

Loading