-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add new version of space-based weighting algorithm #17944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This patch introduces a new weighting algorithm that tries to combine the strengths of both current algorithms. It is primarily space-based, but instead of taking the total free space and apply the fragmentation to it, it weights each bucket of segments (larger segment-size buckets get a higher weighting, to convey that larger free segments are worth more than just their larger size). It also includes information about the bucket of the largest free segment, which allows us to make better decisions about whether the metaslab is suitable for a given allocation. Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Klara, Inc. Signed-off-by: Paul Dagnelie <[email protected]>
|
How about CPU load, especially when running on a NVMe pool? Thanks. |
I don't think it matters, unless you mean CPU load from metaslab loads/unloads. Which loads/unloads I think could be a useful metrics for comparison. |
|
Addition of the new algorithm is a routine, but selection of one is not so much. I remember we selected new fragmentation metrics, and there was some logic, based on typical allocation sizes, etc. Here doubling of the powers I think could be motivated better. I understand the goal that bigger segments are more valuable, but I am not sure I like the easy perspective of overflow. Overflows make different metaslabs to look the same. I am not sure how it may affect fragmentation growth, since I think allocation from the biggest possible segments may give smaller ones time to coalesce with frees. And with further disc capacities growth I think we may start seeing the overflows too often. |
I'll generate those metrics and update the PR with them.
The motivation behind the
From a loading perspective, any metaslab big enough to overflow is pretty close to the same. Even with ashift=9, a metaslab that manages to overflow needs to have at least 2^26 free sectors of space, or 32 GiB of free space. And that's if it's a single segment of that size; if it's smaller segments, there has to be more free space. Exponentially more, as the segment size goes down. If your segments are only 32MB, then you have to have 2^20 of them, or 32TB of free space in that metaslab, to overflow.
In the current vdev code, a ~4PB vdev with 512 byte sectors would be the smallest that could overflow. With raidz3 or draid that is certainly possible, but any metaslabs with significant allocations at all will quickly stop overflowing and become distinguishable, which is what matters. It would take a very large vdev indeed before you would stop being able to tell "these metaslabs are just fine to use" apart from "these metaslabs are too fragmented". But that said, it could happen. Changing to 1.5x would make that take much longer; instead of 2^26 sectors in a segment it's 2^35, or a 16TiB segment at ashift 9. I can rerun the tests with 1.5x and see if that affects the results significantly; probably not. I'll also generate those new metrics at the same time. |
Sponsored by: [Wasabi Technology, Inc.; Klara, Inc.]
Motivation and Context
Currently, the segment-based weighting algorithm is used by default to selected metaslabs for allocations. This algorithm uses the size of the largest bucket of free segments to estimate whether an allocation can be satisfied. This works reasonably well for selecting metaslabs to load, since they should always be able to satisfy at least one allocation. However, there are times where the best metaslab to select is not the one with the most segments of the largest size; for example, if one metaslab has a single 64MiB segment, while another has 100 32MiB free segments, the latter will be able to satisfy many more allocations in practice, as long as the allocations are less than 32MiB. The space-based weighting algorithm will sometimes make this decision correctly, but it also has its own downsides (mostly not having any good idea whether a metaslab will be able to satisfy any allocations at all).
Description
This patch introduces a new weighting algorithm that tries to combine the strengths of both current algorithms. It is primarily space-based, but instead of taking the total free space and apply the fragmentation to it, it weights each bucket of segments (larger segment-size buckets get a higher weighting, to convey that larger free segments are useful for both having more space and satisfying larger individual allocations). It also includes information about the bucket of the largest free segment, which allows us to make better decisions about whether the metaslab is suitable for a given allocation.
I don't love the way the weighting algorithm is currently selected; it may be time to implement something more like the control for the
metaslab_allocatorstunables. I did not include that in this version, but that could be changed if people feel it is important.How Has This Been Tested?
In addition to correctness testing by running the test suite with segment weighting disabled and v2 space weighting enabled, there are also extensive performance testing completed. The setup that was used for the test is a pool that is 90% full, with 60% fragmentation. This allows the testing to primarily focus on the ability of the algorithms to select metaslabs for loading. Testing used random writes with large sector sizes, to further stress the allocation code. The test ran for 20 minutes each time. There are a few key metrics that were considered: bandwidth (self-explanatory), gang blocks and multi-level ganging, and TXG sync times.
While the changes in most categories are small (~1% in average and 99th%ile TXG sync times and bandwidth, within the margin of error), the new algorithm does seem to offer a reduction in stddev of sync times over the segment-based algorithm, and a reduction in ganging compared to both existing algorithms. Only the ganging reduction is statistically significant (95% confidence, 10 runs each).
Types of changes
Checklist:
Signed-off-by.