Add new version of space-based weighting algorithm #17944

pcd1193182 · 2025-11-17T21:33:36Z

Sponsored by: [Wasabi Technology, Inc.; Klara, Inc.]

Motivation and Context

Currently, the segment-based weighting algorithm is used by default to selected metaslabs for allocations. This algorithm uses the size of the largest bucket of free segments to estimate whether an allocation can be satisfied. This works reasonably well for selecting metaslabs to load, since they should always be able to satisfy at least one allocation. However, there are times where the best metaslab to select is not the one with the most segments of the largest size; for example, if one metaslab has a single 64MiB segment, while another has 100 32MiB free segments, the latter will be able to satisfy many more allocations in practice, as long as the allocations are less than 32MiB. The space-based weighting algorithm will sometimes make this decision correctly, but it also has its own downsides (mostly not having any good idea whether a metaslab will be able to satisfy any allocations at all).

Description

This patch introduces a new weighting algorithm that tries to combine the strengths of both current algorithms. It is primarily space-based, but instead of taking the total free space and apply the fragmentation to it, it weights each bucket of segments (larger segment-size buckets get a higher weighting, to convey that larger free segments are useful for both having more space and satisfying larger individual allocations). It also includes information about the bucket of the largest free segment, which allows us to make better decisions about whether the metaslab is suitable for a given allocation.

I don't love the way the weighting algorithm is currently selected; it may be time to implement something more like the control for the metaslab_allocators tunables. I did not include that in this version, but that could be changed if people feel it is important.

How Has This Been Tested?

In addition to correctness testing by running the test suite with segment weighting disabled and v2 space weighting enabled, there are also extensive performance testing completed. The setup that was used for the test is a pool that is 90% full, with 60% fragmentation. This allows the testing to primarily focus on the ability of the algorithms to select metaslabs for loading. Testing used random writes with large sector sizes, to further stress the allocation code. The test ran for 20 minutes each time. There are a few key metrics that were considered: bandwidth (self-explanatory), gang blocks and multi-level ganging, and TXG sync times.

Weighting Algorithm	TXG mean (ms)	TXG stddev (ms)	TXG time (99%ile)	Gang Blocks	Gang multiblocks	BW (MB/s)
Space	111250.5	9366.88	36404.9	138748	110295	162.8
Segment	11624.4	10096.3	36914.4	147258	121747	163.3
New	11743.1	9799.49	37313.4	129805	104044	160.333

While the changes in most categories are small (~1% in average and 99th%ile TXG sync times and bandwidth, within the margin of error), the new algorithm does seem to offer a reduction in stddev of sync times over the segment-based algorithm, and a reduction in ganging compared to both existing algorithms. Only the ganging reduction is statistically significant (95% confidence, 10 runs each).

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

This patch introduces a new weighting algorithm that tries to combine the strengths of both current algorithms. It is primarily space-based, but instead of taking the total free space and apply the fragmentation to it, it weights each bucket of segments (larger segment-size buckets get a higher weighting, to convey that larger free segments are worth more than just their larger size). It also includes information about the bucket of the largest free segment, which allows us to make better decisions about whether the metaslab is suitable for a given allocation. Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Klara, Inc. Signed-off-by: Paul Dagnelie <[email protected]>

shodanshok · 2025-11-18T07:32:35Z

How about CPU load, especially when running on a NVMe pool? Thanks.

amotin · 2025-11-18T17:11:35Z

How about CPU load, especially when running on a NVMe pool? Thanks.

I don't think it matters, unless you mean CPU load from metaslab loads/unloads. Which loads/unloads I think could be a useful metrics for comparison.

amotin · 2025-11-18T18:41:52Z

Addition of the new algorithm is a routine, but selection of one is not so much. I remember we selected new fragmentation metrics, and there was some logic, based on typical allocation sizes, etc. Here doubling of the powers I think could be motivated better. I understand the goal that bigger segments are more valuable, but I am not sure I like the easy perspective of overflow. Overflows make different metaslabs to look the same. I am not sure how it may affect fragmentation growth, since I think allocation from the biggest possible segments may give smaller ones time to coalesce with frees. And with further disc capacities growth I think we may start seeing the overflows too often.

pcd1193182 · 2025-11-19T00:41:42Z

I don't think it matters, unless you mean CPU load from metaslab loads/unloads. Which loads/unloads I think could be a useful metrics for comparison.

I'll generate those metrics and update the PR with them.

Here doubling of the powers I think could be motivated better. I understand the goal that bigger segments are more valuable, but I am not sure I like the easy perspective of overflow.

The motivation behind the 2^{2i} is somewhat fuzzy, I'll admit. I considered other weightings, like 2^{1.5*i} or i^2 * 2^i but they didn't seem to scale the way I wanted; is a 64 MiB segment 40% or 8% better than 2 32MiB segments? Part of the problem with the whole system of weighting is that how good a metaslab is depends on the workload that we're dealing with, which makes it very hard to consider these things in the general case. But I could see 1.5x scaling being better than 2x.

Overflows make different metaslabs to look the same.

From a loading perspective, any metaslab big enough to overflow is pretty close to the same. Even with ashift=9, a metaslab that manages to overflow needs to have at least 2^26 free sectors of space, or 32 GiB of free space. And that's if it's a single segment of that size; if it's smaller segments, there has to be more free space. Exponentially more, as the segment size goes down. If your segments are only 32MB, then you have to have 2^20 of them, or 32TB of free space in that metaslab, to overflow.

I am not sure how it may affect fragmentation growth, since I think allocation from the biggest possible segments may give smaller ones time to coalesce with frees. And with further disc capacities growth I think we may start seeing the overflows too often.

In the current vdev code, a ~4PB vdev with 512 byte sectors would be the smallest that could overflow. With raidz3 or draid that is certainly possible, but any metaslabs with significant allocations at all will quickly stop overflowing and become distinguishable, which is what matters. It would take a very large vdev indeed before you would stop being able to tell "these metaslabs are just fine to use" apart from "these metaslabs are too fragmented". But that said, it could happen. Changing to 1.5x would make that take much longer; instead of 2^26 sectors in a segment it's 2^35, or a 16TiB segment at ashift 9.

I can rerun the tests with 1.5x and see if that affects the results significantly; probably not. I'll also generate those new metrics at the same time.

amotin added Status: Code Review Needed Ready for review and testing Status: Design Review Needed Architecture or design is under discussion labels Nov 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add new version of space-based weighting algorithm #17944

Add new version of space-based weighting algorithm #17944

pcd1193182 commented Nov 17, 2025

Uh oh!

shodanshok commented Nov 18, 2025

Uh oh!

amotin commented Nov 18, 2025 •

edited

Loading

Uh oh!

amotin commented Nov 18, 2025

Uh oh!

pcd1193182 commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add new version of space-based weighting algorithm #17944

Are you sure you want to change the base?

Add new version of space-based weighting algorithm #17944

Conversation

pcd1193182 commented Nov 17, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

shodanshok commented Nov 18, 2025

Uh oh!

amotin commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin commented Nov 18, 2025

Uh oh!

pcd1193182 commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amotin commented Nov 18, 2025 •

edited

Loading