Playground for exploring implementation techniques for sorting networks. These can sort small arrays much faster
than Array.Sort()
; depending on the size (4-32) and pattern, the speedup is 3-6X. See benchmarks below.
The generated assembly in Release mode is lean and mean, and seems comparable with what would have been generated by
a C++ compiler.
- Implemented Fisher-Yates shuffle; it is now used in benchmarks for more reliable validation of sorters.
- Support for arbitrary length arrays, i.e., not just lengths that are power of 2.
- Exhaustive validation now checks sorters for all lengths 4-32.
- Added (32-bit) float sorter.
The projects are developed with Visual Studio 2019 and target netcore3.1. The solution consists of two projects.
This project dependes on BenchmarkDotNet. It contains validation code, benchmarks and demonstrates the use of sorting methods.
The main program must be run with a single argument: VI
, VF
or B
.
When run with VI
, it runs an exhaustive validation of integer networks for element counts of up to 32. When run with VF
it runs an exhaustive validation of float networks for element counts of up to 32. Larger sizes are infeasible, as 2^N
zero/one inputs would have to be tested.
When run with "B", it passes the rest of the arguments to BenchmarkDotNet. Without any additional arguments, it will present a menu.
All benchmarks call Environment.FailFast
if the result is found to be unsorted so that this can be detected in the logs.
SortingNetworks
project is the main code and has no dependencies. The high-performance public types use unsafe
code and can only be used from unsafe
methods. The code depends on AVX2 instruction set. In addition, AESRand
class depends on AES-NI instruction set.
The main interface is UnsafeSort<T>
class which exposes a static factory function. The actual sorting code is in
IntSorter
and FloatSorter
classes. You are not expected to understand how it works without studying references.
The code to handle lengths that are not power of two introduces some overhead even for small arrays, so PeriodicInt
class is
provided with methods for sorting arrays of lengths 4, 8, 16 and 32; see benchmarks below.
UnsafeSort<T>
and PeriodicInt
classes have no mutable internal state, so it is recommended to use a single (non-static) instance
throughout the program (see remark about statics below).
Directory Attic
contains the (failed) experiment with expression trees and earlier (less performant) iterations of the
periodic network.
This subsystem consists of three classes: and abstract UnsafeRandom
class and two concrete classes: AESRand
and MWC1616Rand
.
These can be instantiated directly. NB! The correctness of the code and the quality of random numbers has not been verified!
Benchmarks use MWC1616Rand
with a fixed seed as AESRand
seemed to generate some obvious patterns.
These were learned by inspecting the generated assembly code in Release mode.
Accessing static data has more overhead than accessing instance data: extraneous CALL instructions into the runtime are generated. My guess is that these ensure thread-safe, once-only static initialization semantics.
Accessing ref
parameters as in Periodics16Branchless
generates a lot of load/store instructions.
It is much more efficient to load ref parameters into locals at the beginning of the procedure and store
results at the end, as in PeriodicInt
.
Periodic16Expr
demonstrates how to build a sorter with expression trees. The generated assembly is OK,
save for the long prologue/epilogue sequences This makes the overhead of calling a lambda compiled at run-time
way too big for this application.
unsafe
is not viral: Method A
is allowed to call unsafe
method B
without A
having to be marked
unsafe as well. Also, it is allowed to assign an unsafe
method to a non-unsafe delegate variable.
System.Random
does not have consistent timing: when used in the baseline benchmark, the results almost always
contained a warning about it having a bimodal distribution. This makes it rather unusable in baseline benchmarks.
Therefore UnsafeRandom
, AESRand
and MWC1616Rand
classes were implemented. Of these, only MWC is being used.
Generics suck for numeric code. I couldn't figure out how to write a generic bool IsSorted(T[])
method that'd
work for any numeric type. Adding where T : unmanaged
doesn't help as the compiler doesn't know that unmanaged
types are comparable with less-than and equal. Nor does it seem possible to write void Iota(T[] data)
that'd
fill data
with numbers from 0 .. Length-1
. This is apparently being actively worked on for new versions
of .NET and C#.
I attempted to make concrete benchmark classes sealed
, but that makes BenchmarkDotNet fail because it apparently
needs to derive from the benchmark class.
RyuJIT has some impressive optimizations: despite branches in "block" methods in PeriodicInt
, it manages to generate
branchless code when constants that branches depend on are known at compile-time. It also elides unnecessary loads and
stores to/from ref variables and inlines impressively. The generated machine code, however, is huge: 32-sorter is > 1kB
in size. If considering larger sorters, inlining should be forced.
Raw benchmark data with excel file used to generate the report are in BenchmarkResults. Main results with comments are presented here (PDF) with additional comments below.
I couldn't figure out how to coerce BenchmarkDotNet into treating the baseline as additive overhead instead of, well, baseline.
(Actually, that's what [IterationSetup]
and [IterationCleanup]
are for, but they come with a warning that they could spoil results
of microbenchmarks.) The analysis presents results after subtracting the additive overhead.
Even for small sizes, UnsafeSort<int>
is slightly slower than PeriodicInt
which works for fixed-length arrays only
(compare "IntBenchmark" with "Specialized" ). For example, PeriodicInt
takes ~22ns to sort 16 elements, whereas
UnsafeSort<int>
takes ~38ns. Even though the additional logic to handle all sizes below 16 is relatively simple, it
shows in running times.
Sorting floating point numbers seems to be slightly slower than integers ("Int vs Float").
This project was initially started to investigate manual code generation using expression trees, but it turns out that
these are unsuitable for high-performance scenarios as the prologue/epilogue in the generated code has way too high overhead
(see ExpressionInvocationBenchmark
):
Method | Mean | Error | StdDev |
---|---|---|---|
DirectInvoke | 45.51 ns | 0.934 ns | 2.147 ns |
ExpressionInvoke | 124.08 ns | 2.512 ns | 6.747 ns |
On the other hand, there is no substantial difference between directly invoking an instance method, or invoking it through an
abstract base method. Thus there is no penalty in using the more convenient UnsafeSort
class as opposed to directly calling
methods on an instance of PeriodicInt
:
Method | Mean | Error | StdDev |
---|---|---|---|
AbstractInvoke | 23.80 ns | 0.421 ns | 0.603 ns |
ConcreteInvoke | 23.28 ns | 0.310 ns | 0.290 ns |
NB! The results between the two benchmarks are not directly comparable as they run different algorithms. TODO: same for float. Re-export analysis.
D. E. Knuth, The Art of Computer Programming, vol. 3, section 5.3.4 for basic exposition. The ""periodic" network as implemented here appears in TAOCP exercise 53, but has first been described by Dowd et al.: "The Periodic Balanced Sorting Network", JACM Vol. 36, No. 4, October 1989, pp. 738-757.
Other references appear in code comments.