Improve parallel speedup on pipelined benchmark #761

dcoutts · 2025-06-17T11:48:31Z

This now gets real parallel speedups on the WP8 benchmark in pipelined
mode.

On my laptop, we get:

non-pipelined mode: 86.5k
before: pipelined mode (2 cores): 92.2k
after: pipelined mode (2 cores): 120.0k

In part this is because pipelined mode on 1 core is a regression: 70.1k
because it has to do strictly more work, and it avoids doing any
batching which normally improves performance.

The crucial thing is minimising batching of merge work, so that we get better parallel work balance. To do this we expose a new MergeBatchSize in the TableConfig and allow overriding it in the TableConfigOverride.

jorisdral

Looks good, no comments on the general idea. Just some smaller comments to resolve

bench/macro/lsm-tree-bench-wp8.hs

src/Database/LSMTree/Internal/Config.hs

jorisdral · 2025-06-17T13:02:24Z

src/Database/LSMTree/Internal/Config.hs

+-- TODO: the thresholds for doing merge work should be different for each level,
+-- and ideally all-pairs co-prime.


Should the thresholds also have some relationship with the size of update batches? Even if the thresholds are co-prime, if the update batch is large enough then we could hit all thresholds at the same time

That's true of course. But the update batch size is only known dynamically and it can change.

Yes, ideally doing a big batch of updates would not re-synchronise the counters relative to their thresholds.

src/Database/LSMTree/Internal/Config/Override.hs

src/Database/LSMTree/Internal/Snapshot/Codec.hs

test/Test/Database/LSMTree/Internal/Snapshot/Codec/Golden.hs

But don't yet actually change the serialisation format. This is partly just to demonstrate to ourselves how to do it, so there's a pattern to follow in future. Doing this highlights that we cannot generally match on the version, and should only do so in places where the format is actually different between versions. Otherwise we would have to duplicate too much code.

Previously it was hard coded to be the same as the write buffer size. Document what it means as a new tunable parameter. Setting this low (1) is important for getting good parallel work balance on the pipelined WP8 benchmark. It is a crucial change that makes the pipelined version actually improve performance. Previously it would only get about a 5 to 10% improvement.

And add MergeBatchSize to TableConfigOverride.

This now gets real parallel speedups on the WP8 benchmark in pipelined mode. On my laptop, we get: * non-pipelined mode: 86.5k * before: pipelined mode (2 cores): 92.2k * after: pipelined mode (2 cores): 120.0k In part this is because pipelined mode on 1 core is a regression: 70.1k because it has to do strictly more work, and it avoids doing any batching which normally improves performance.

jorisdral

LGTM!

I have a slightly altered version of this branch on jdral/wp8-bench-pipelined-3, that maybe you could look at. The change I make there is that we keep the golden files for both V0 and V1 around, so that at some point we could test backwards compatibility of versioned decoders. See the second commit. Since our implementation only encodes in the current snapshot version, we'd need those golden files to check backwards compatibility. If you agree with this change, then we could port those commits to this branch

This would provide a minimal diff between your branch and mine:

git diff origin/dcoutts/wp8-bench-pipelined-3 origin/jdral/wp8-bench-pipelined-3

dcoutts requested review from jorisdral, mheinzel, recursion-ninja and wenkokke as code owners June 17, 2025 11:48

jorisdral reviewed Jun 17, 2025

View reviewed changes

dcoutts force-pushed the dcoutts/wp8-bench-pipelined-3 branch from c3ee744 to 2d55f19 Compare June 17, 2025 13:38

dcoutts force-pushed the dcoutts/wp8-bench-pipelined-3 branch from 25a5a16 to a5ee0ba Compare June 19, 2025 10:38

dcoutts requested a review from jorisdral June 19, 2025 10:38

dcoutts added 3 commits June 19, 2025 13:53

Generalise OverrideDiskCachePolicy to TableConfigOverride

44db7d3

And add MergeBatchSize to TableConfigOverride.

dcoutts force-pushed the dcoutts/wp8-bench-pipelined-3 branch from a5ee0ba to f2f42c3 Compare June 19, 2025 12:53

jorisdral approved these changes Jun 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve parallel speedup on pipelined benchmark #761

Improve parallel speedup on pipelined benchmark #761

Uh oh!

dcoutts commented Jun 17, 2025

Uh oh!

jorisdral left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisdral Jun 17, 2025

Uh oh!

dcoutts Jun 17, 2025

Uh oh!

dcoutts Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisdral left a comment •

edited

Loading

Uh oh!

Uh oh!

		-- TODO: the thresholds for doing merge work should be different for each level,
		-- and ideally all-pairs co-prime.

Improve parallel speedup on pipelined benchmark #761

Are you sure you want to change the base?

Improve parallel speedup on pipelined benchmark #761

Uh oh!

Conversation

dcoutts commented Jun 17, 2025

Uh oh!

jorisdral left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisdral Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

dcoutts Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

dcoutts Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jorisdral left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisdral left a comment •

edited

Loading