Skip to content

zvrba/SortingNetworks

Repository files navigation

Sorting networks

Playground for exploring implementation techniques for sorting networks. These can sort small arrays much faster than Array.Sort(); depending on the size (4-32) and pattern, the speedup is 3-6X. See benchmarks below. The generated assembly in Release mode is lean and mean, and seems comparable with what would have been generated by a C++ compiler.

Changes since v1

  • Implemented Fisher-Yates shuffle; it is now used in benchmarks for more reliable validation of sorters.
  • Support for arbitrary length arrays, i.e., not just lengths that are power of 2.
  • Exhaustive validation now checks sorters for all lengths 4-32.
  • Added (32-bit) float sorter.

Project structure

The projects are developed with Visual Studio 2019 and target netcore3.1. The solution consists of two projects.

SNBenchmark

This project dependes on BenchmarkDotNet. It contains validation code, benchmarks and demonstrates the use of sorting methods. The main program must be run with a single argument: VI, VF or B.

When run with VI, it runs an exhaustive validation of integer networks for element counts of up to 32. When run with VF it runs an exhaustive validation of float networks for element counts of up to 32. Larger sizes are infeasible, as 2^N zero/one inputs would have to be tested.

When run with "B", it passes the rest of the arguments to BenchmarkDotNet. Without any additional arguments, it will present a menu. All benchmarks call Environment.FailFast if the result is found to be unsorted so that this can be detected in the logs.

SortingNetworks

SortingNetworks project is the main code and has no dependencies. The high-performance public types use unsafe code and can only be used from unsafe methods. The code depends on AVX2 instruction set. In addition, AESRand class depends on AES-NI instruction set.

Sorting

The main interface is UnsafeSort<T> class which exposes a static factory function. The actual sorting code is in IntSorter and FloatSorter classes. You are not expected to understand how it works without studying references. The code to handle lengths that are not power of two introduces some overhead even for small arrays, so PeriodicInt class is provided with methods for sorting arrays of lengths 4, 8, 16 and 32; see benchmarks below.

UnsafeSort<T> and PeriodicInt classes have no mutable internal state, so it is recommended to use a single (non-static) instance throughout the program (see remark about statics below).

Directory Attic contains the (failed) experiment with expression trees and earlier (less performant) iterations of the periodic network.

Random numbers

This subsystem consists of three classes: and abstract UnsafeRandom class and two concrete classes: AESRand and MWC1616Rand. These can be instantiated directly. NB! The correctness of the code and the quality of random numbers has not been verified! Benchmarks use MWC1616Rand with a fixed seed as AESRand seemed to generate some obvious patterns.

Lessons learned

These were learned by inspecting the generated assembly code in Release mode.

Accessing static data has more overhead than accessing instance data: extraneous CALL instructions into the runtime are generated. My guess is that these ensure thread-safe, once-only static initialization semantics.

Accessing ref parameters as in Periodics16Branchless generates a lot of load/store instructions. It is much more efficient to load ref parameters into locals at the beginning of the procedure and store results at the end, as in PeriodicInt.

Periodic16Expr demonstrates how to build a sorter with expression trees. The generated assembly is OK, save for the long prologue/epilogue sequences This makes the overhead of calling a lambda compiled at run-time way too big for this application.

unsafe is not viral: Method A is allowed to call unsafe method B without A having to be marked unsafe as well. Also, it is allowed to assign an unsafe method to a non-unsafe delegate variable.

System.Random does not have consistent timing: when used in the baseline benchmark, the results almost always contained a warning about it having a bimodal distribution. This makes it rather unusable in baseline benchmarks. Therefore UnsafeRandom, AESRand and MWC1616Rand classes were implemented. Of these, only MWC is being used.

Generics suck for numeric code. I couldn't figure out how to write a generic bool IsSorted(T[]) method that'd work for any numeric type. Adding where T : unmanaged doesn't help as the compiler doesn't know that unmanaged types are comparable with less-than and equal. Nor does it seem possible to write void Iota(T[] data) that'd fill data with numbers from 0 .. Length-1. This is apparently being actively worked on for new versions of .NET and C#.

I attempted to make concrete benchmark classes sealed, but that makes BenchmarkDotNet fail because it apparently needs to derive from the benchmark class.

RyuJIT has some impressive optimizations: despite branches in "block" methods in PeriodicInt, it manages to generate branchless code when constants that branches depend on are known at compile-time. It also elides unnecessary loads and stores to/from ref variables and inlines impressively. The generated machine code, however, is huge: 32-sorter is > 1kB in size. If considering larger sorters, inlining should be forced.

Benchmarks

Raw benchmark data with excel file used to generate the report are in BenchmarkResults. Main results with comments are presented here (PDF) with additional comments below.

I couldn't figure out how to coerce BenchmarkDotNet into treating the baseline as additive overhead instead of, well, baseline. (Actually, that's what [IterationSetup] and [IterationCleanup] are for, but they come with a warning that they could spoil results of microbenchmarks.) The analysis presents results after subtracting the additive overhead.

General observations

Even for small sizes, UnsafeSort<int> is slightly slower than PeriodicInt which works for fixed-length arrays only (compare "IntBenchmark" with "Specialized" ). For example, PeriodicInt takes ~22ns to sort 16 elements, whereas UnsafeSort<int> takes ~38ns. Even though the additional logic to handle all sizes below 16 is relatively simple, it shows in running times.

Sorting floating point numbers seems to be slightly slower than integers ("Int vs Float").

Invocation: direct vs delegate vs compiled expression

This project was initially started to investigate manual code generation using expression trees, but it turns out that these are unsuitable for high-performance scenarios as the prologue/epilogue in the generated code has way too high overhead (see ExpressionInvocationBenchmark):

Method Mean Error StdDev
DirectInvoke 45.51 ns 0.934 ns 2.147 ns
ExpressionInvoke 124.08 ns 2.512 ns 6.747 ns

On the other hand, there is no substantial difference between directly invoking an instance method, or invoking it through an abstract base method. Thus there is no penalty in using the more convenient UnsafeSort class as opposed to directly calling methods on an instance of PeriodicInt:

Method Mean Error StdDev
AbstractInvoke 23.80 ns 0.421 ns 0.603 ns
ConcreteInvoke 23.28 ns 0.310 ns 0.290 ns

NB! The results between the two benchmarks are not directly comparable as they run different algorithms. TODO: same for float. Re-export analysis.

References

D. E. Knuth, The Art of Computer Programming, vol. 3, section 5.3.4 for basic exposition. The ""periodic" network as implemented here appears in TAOCP exercise 53, but has first been described by Dowd et al.: "The Periodic Balanced Sorting Network", JACM Vol. 36, No. 4, October 1989, pp. 738-757.

Other references appear in code comments.

About

Fast SIMD sorting routines for int and float arrays

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages