Skip to content

Conversation

@pcd1193182
Copy link
Contributor

Sponsored by: [Wasabi Technology, Inc.; Klara, Inc.]

Motivation and Context

Currently, the segment-based weighting algorithm is used by default to selected metaslabs for allocations. This algorithm uses the size of the largest bucket of free segments to estimate whether an allocation can be satisfied. This works reasonably well for selecting metaslabs to load, since they should always be able to satisfy at least one allocation. However, there are times where the best metaslab to select is not the one with the most segments of the largest size; for example, if one metaslab has a single 64MiB segment, while another has 100 32MiB free segments, the latter will be able to satisfy many more allocations in practice, as long as the allocations are less than 32MiB. The space-based weighting algorithm will sometimes make this decision correctly, but it also has its own downsides (mostly not having any good idea whether a metaslab will be able to satisfy any allocations at all).

Description

This patch introduces a new weighting algorithm that tries to combine the strengths of both current algorithms. It is primarily space-based, but instead of taking the total free space and apply the fragmentation to it, it weights each bucket of segments (larger segment-size buckets get a higher weighting, to convey that larger free segments are useful for both having more space and satisfying larger individual allocations). It also includes information about the bucket of the largest free segment, which allows us to make better decisions about whether the metaslab is suitable for a given allocation.

I don't love the way the weighting algorithm is currently selected; it may be time to implement something more like the control for the metaslab_allocators tunables. I did not include that in this version, but that could be changed if people feel it is important.

How Has This Been Tested?

In addition to correctness testing by running the test suite with segment weighting disabled and v2 space weighting enabled, there are also extensive performance testing completed. The setup that was used for the test is a pool that is 90% full, with 60% fragmentation. This allows the testing to primarily focus on the ability of the algorithms to select metaslabs for loading. Testing used random writes with large sector sizes, to further stress the allocation code. The test ran for 20 minutes each time. There are a few key metrics that were considered: bandwidth (self-explanatory), gang blocks and multi-level ganging, and TXG sync times.

Weighting Algorithm TXG mean (ms) TXG stddev (ms) TXG time (99%ile) Gang Blocks Gang multiblocks BW (MB/s)
Space 111250.5 9366.88 36404.9 138748 110295 162.8
Segment 11624.4 10096.3 36914.4 147258 121747 163.3
New 11743.1 9799.49 37313.4 129805 104044 160.333

While the changes in most categories are small (~1% in average and 99th%ile TXG sync times and bandwidth, within the margin of error), the new algorithm does seem to offer a reduction in stddev of sync times over the segment-based algorithm, and a reduction in ganging compared to both existing algorithms. Only the ganging reduction is statistically significant (95% confidence, 10 runs each).

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

This patch introduces a new weighting algorithm that tries to combine
the strengths of both current algorithms. It is primarily space-based,
but instead of taking the total free space and apply the fragmentation
to it, it weights each bucket of segments (larger segment-size buckets
get a higher weighting, to convey that larger free segments are worth
more than just their larger size). It also includes information about
the bucket of the largest free segment, which allows us to make better
decisions about whether the metaslab is suitable for a given allocation.

Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Signed-off-by: Paul Dagnelie <[email protected]>
@shodanshok
Copy link
Contributor

How about CPU load, especially when running on a NVMe pool? Thanks.

@amotin
Copy link
Member

amotin commented Nov 18, 2025

How about CPU load, especially when running on a NVMe pool? Thanks.

I don't think it matters, unless you mean CPU load from metaslab loads/unloads. Which loads/unloads I think could be a useful metrics for comparison.

@amotin
Copy link
Member

amotin commented Nov 18, 2025

Addition of the new algorithm is a routine, but selection of one is not so much. I remember we selected new fragmentation metrics, and there was some logic, based on typical allocation sizes, etc. Here doubling of the powers I think could be motivated better. I understand the goal that bigger segments are more valuable, but I am not sure I like the easy perspective of overflow. Overflows make different metaslabs to look the same. I am not sure how it may affect fragmentation growth, since I think allocation from the biggest possible segments may give smaller ones time to coalesce with frees. And with further disc capacities growth I think we may start seeing the overflows too often.

@amotin amotin added Status: Code Review Needed Ready for review and testing Status: Design Review Needed Architecture or design is under discussion labels Nov 18, 2025
@pcd1193182
Copy link
Contributor Author

I don't think it matters, unless you mean CPU load from metaslab loads/unloads. Which loads/unloads I think could be a useful metrics for comparison.

I'll generate those metrics and update the PR with them.

Here doubling of the powers I think could be motivated better. I understand the goal that bigger segments are more valuable, but I am not sure I like the easy perspective of overflow.

The motivation behind the 2^{2i} is somewhat fuzzy, I'll admit. I considered other weightings, like 2^{1.5*i} or i^2 * 2^i but they didn't seem to scale the way I wanted; is a 64 MiB segment 40% or 8% better than 2 32MiB segments? Part of the problem with the whole system of weighting is that how good a metaslab is depends on the workload that we're dealing with, which makes it very hard to consider these things in the general case. But I could see 1.5x scaling being better than 2x.

Overflows make different metaslabs to look the same.

From a loading perspective, any metaslab big enough to overflow is pretty close to the same. Even with ashift=9, a metaslab that manages to overflow needs to have at least 2^26 free sectors of space, or 32 GiB of free space. And that's if it's a single segment of that size; if it's smaller segments, there has to be more free space. Exponentially more, as the segment size goes down. If your segments are only 32MB, then you have to have 2^20 of them, or 32TB of free space in that metaslab, to overflow.

I am not sure how it may affect fragmentation growth, since I think allocation from the biggest possible segments may give smaller ones time to coalesce with frees. And with further disc capacities growth I think we may start seeing the overflows too often.

In the current vdev code, a ~4PB vdev with 512 byte sectors would be the smallest that could overflow. With raidz3 or draid that is certainly possible, but any metaslabs with significant allocations at all will quickly stop overflowing and become distinguishable, which is what matters. It would take a very large vdev indeed before you would stop being able to tell "these metaslabs are just fine to use" apart from "these metaslabs are too fragmented". But that said, it could happen. Changing to 1.5x would make that take much longer; instead of 2^26 sectors in a segment it's 2^35, or a 16TiB segment at ashift 9.

I can rerun the tests with 1.5x and see if that affects the results significantly; probably not. I'll also generate those new metrics at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Code Review Needed Ready for review and testing Status: Design Review Needed Architecture or design is under discussion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants