-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Hash join buffering on probe side #19761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Hash join buffering on probe side #19761
Conversation
|
run benchmarks |
|
🤖 Hi @gabotechs, thanks for the request (#19761 (comment)). |
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark tpcds tpch10 |
|
🤖 |
|
Benchmark script failed with exit code 1. Last 10 lines of output: Click to expand |
|
run benchmark tpch10 |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark tpch |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark tpcds |
|
🤖 |
|
Benchmark script failed with exit code 1. Last 10 lines of output: Click to expand |
|
🤔 the tpcds benchmark command seems broken |
|
Interesting idea, do you have some insights on the memory usage vs not doing this "eager execution"? |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
For.tpcds, it seems mostly speedups but also some (4x) slowdowns. Any way we could avoid those? |
This definitely has an impact to memory consumption, as it holds record batches in-memory until the hash join decides to start consuming them. This is the reason why it's important to put a limit to how much memory is buffered (currently configurable). With the current setup reported in the benchmarks, it will buffer at most 1Mb per partition (can be configured with |
I would not trust too much the benchmarks I reported in the PR description, for the good and for the bad, I've seen that the same query can take 300ms or 2500ms depending on whatever my specific laptop decides to be doing while the benchmark runs. I'd like to run the TPC-DS benchmarks using robot Andrew, which I assume they run in a more stable environment than my laptop. |
|
Thanks for the explanations!
|
|
run benchmark tpcds |
|
🤖 |
|
Benchmark script failed with exit code 1. Last 10 lines of output: Click to expand |
That would be easy to do, however, I fear that it can very easily end up in deadlocks. For example, if partition 0 exhausts all the memory budget, polling any other partition will block until someone pulls something out of partition 0, which might never happen as whoever could potentially poll partition 0 is to busy deadlocked on partition 1. A more health behavior IMO would be to have a memory budget per-partition and just put the limit lower: rather than having a global 10Mb, just have a per-partition 1Mb limit.
🤔 that's interesting, we do might be able react appropriately to |
|
run benchmark tpcds |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark tpcds |
|
🤖 |
|
🤖: Benchmark completed Details
|
3e4660b to
cdc6ad1
Compare
|
It does seem that some queries get a significant slowdown... I think this needs further investigation. |
Which issue does this PR close?
It does not close any issue, but it's related to:
Rationale for this change
This is a PR from a batch of PRs that attempt to improve performance in hash joins:
It adds the new
BufferExecnode at the top of the probe side of hash joins so that some work is eagerly performed before the build side of the hash join is completely finished.Why should this speed up joins?
In order to better understand the impact of this PR, it's useful to understand how streams work in Rust: creating a stream does not perform any work, progress is just made if the stream gets polled.
This means that whenever we call
.execute()on anExecutionPlan(like the probe side of a join), nothing happens, not even the most basic TCP connections or system calls are performed. Instead, all this work is delayed as much as possible until the first poll is made to the stream, losing the opportunity to make some early progress.This gets worst when multiple hash joins are chained together: they will get executed in cascade as if they were domino pieces, which has the benefit of leaving a small memory footprint, but underutilizes the resources of the machine for executing the query faster.
NOTE: still don't know if this improves the benchmarks, just experimenting for now
What changes are included in this PR?
Adds a new
HashJoinBufferingphysical optimizer rule that will idempotently placeBufferExecnodes on the probe side of has joins:Are these changes tested?
yes, by existing tests
Are there any user-facing changes?
yes, users will see a new
BufferExecbeing placed at top of the probe side of each hash join. (Still unsure about whether de default mode should be enabled)Results
Warning
I'm very skeptical about this benchmarks run on my laptop, take them with a grain of salt, they should be run in a more controlled environment