-
Notifications
You must be signed in to change notification settings - Fork 8
Improve parallel speedup on pipelined benchmark #761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, no comments on the general idea. Just some smaller comments to resolve
-- TODO: the thresholds for doing merge work should be different for each level, | ||
-- and ideally all-pairs co-prime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the thresholds also have some relationship with the size of update batches? Even if the thresholds are co-prime, if the update batch is large enough then we could hit all thresholds at the same time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true of course. But the update batch size is only known dynamically and it can change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ideally doing a big batch of updates would not re-synchronise the counters relative to their thresholds.
c3ee744
to
2d55f19
Compare
But don't yet actually change the serialisation format. This is partly just to demonstrate to ourselves how to do it, so there's a pattern to follow in future. Doing this highlights that we cannot generally match on the version, and should only do so in places where the format is actually different between versions. Otherwise we would have to duplicate too much code.
25a5a16
to
a5ee0ba
Compare
Previously it was hard coded to be the same as the write buffer size. Document what it means as a new tunable parameter. Setting this low (1) is important for getting good parallel work balance on the pipelined WP8 benchmark. It is a crucial change that makes the pipelined version actually improve performance. Previously it would only get about a 5 to 10% improvement.
And add MergeBatchSize to TableConfigOverride.
This now gets real parallel speedups on the WP8 benchmark in pipelined mode. On my laptop, we get: * non-pipelined mode: 86.5k * before: pipelined mode (2 cores): 92.2k * after: pipelined mode (2 cores): 120.0k In part this is because pipelined mode on 1 core is a regression: 70.1k because it has to do strictly more work, and it avoids doing any batching which normally improves performance.
a5ee0ba
to
f2f42c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I have a slightly altered version of this branch on jdral/wp8-bench-pipelined-3
, that maybe you could look at. The change I make there is that we keep the golden files for both V0
and V1
around, so that at some point we could test backwards compatibility of versioned decoders. See the second commit. Since our implementation only encodes in the current snapshot version, we'd need those golden files to check backwards compatibility. If you agree with this change, then we could port those commits to this branch
This would provide a minimal diff between your branch and mine:
git diff origin/dcoutts/wp8-bench-pipelined-3 origin/jdral/wp8-bench-pipelined-3
This now gets real parallel speedups on the WP8 benchmark in pipelined
mode.
On my laptop, we get:
In part this is because pipelined mode on 1 core is a regression: 70.1k
because it has to do strictly more work, and it avoids doing any
batching which normally improves performance.
The crucial thing is minimising batching of merge work, so that we get better parallel work balance. To do this we expose a new
MergeBatchSize
in theTableConfig
and allow overriding it in theTableConfigOverride
.