Immutable collections are "all the rage" these days, for good reasons. Their story in .NET, however, is very fragmented:
IReadOnlyCollection
,ReadOnlyCollection
wrappers (not thread-safe)System.Collections.Immutable
(thread-safe, but slow and memory-hungry)- and, int .NET8, "frozen collections", with clumsy semantics (https://devblogs.microsoft.com/premier-developer/immutable-collections-with-mutable-performance/)
However, there are no mutable collections with cheap "copy on write" semantics.
Experiments with abstract statics in interfaces to implement traits-like design as commonly seen in C++. The technique is applied to model binary search trees with reduced overheads (see benchmarks below).
The solution consists of three assemblies:
Pfm.Collections
contains various data structure implementationsPfm.Test
has (somewhat messy) correctness testsPfm.Benchmark
has benchmarks.
This project has been inspired by "Persistence for the Masses" paper, hence also the name.
Collection namespaces are divided by data structure and implementation technique:
Trie.DenseTrie
implements persistent vector supporting direct access and one-sided push/pop operations (akin to Clojure vectors)TreeSet
implements persistent joinable balanced trees, in AVL and weight-balanced variants. In addition to usual sorted set operations, it also provides iteratos for forward and backward navigation, fast (logarithmic) access to n'th element in sorted order, and user-defined monoidal "augmentations" that can be used to implement, for example, an interval tree.
Tests include cases that cover correctness of copy on write semantics. Benchmarks attempt to cover the cost of COW semantics.
I have for a long time been annoyed by the fact that the standard Dictionary<K,V>
does not support set
operations (intersection, difference, union), even though it is an ISet<K>
. TreeSet
is a building block
that can be used to implement a MergeableDictionary<K,V>
that would support set operations with
configurable merging of equivalent values. Efficient bulk-build from a sorted sequence would also be possible
to implement.
Sparse trie - Bagwell's HAMT.
Method:
- Build the solution in Release mode
- Run
Pfm.Benchmark.exe
from a console elevated to admin (needed to collect HW perf counters)
Environment:
BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22621.963) 12th Gen Intel Core i7-12700, 1 CPU, 20 logical and 12 physical cores .NET SDK=7.0.100 [Host] : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT AVX2 DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT AVX2
SortedSet
is used as a reference implementation. The underlying implementation is a RB-tree which does less
work for insertions and deletions, but results in deeper trees.
This benchmark tries to find the "worst" sequence of insertions and deletions for a given tree implementation. Results (not shown) indicate that random/random is among the top 5 of 49 combinations, so it is used throughout the other benchmarks.
The following benchmarks use a random sequence of 8197 elements.
This benchmark inserts a random sequence into the tree, then removes the elements in reverse order of insertion.
Method | Mean | Error | StdDev | BranchInstructions/Op | InstructionRetired/Op | CacheMisses/Op | Gen0 | Gen1 | Allocated |
---|---|---|---|---|---|---|---|---|---|
AvlTreeSet | 2.806 ms | 0.0203 ms | 0.0190 ms | 5,132,036 | 26,197,461 | 5,660 | 27.3438 | 7.8125 | 384.29 KB |
WBTreeSet | 2.389 ms | 0.0068 ms | 0.0060 ms | 4,860,134 | 27,351,823 | 5,278 | 27.3438 | 7.8125 | 384.29 KB |
AvlCOWSet | 2.949 ms | 0.0085 ms | 0.0079 ms | 5,216,734 | 26,441,406 | 8,294 | 62.5000 | 39.0625 | 833.37 KB |
WBCOWSet | 2.584 ms | 0.0139 ms | 0.0123 ms | 4,965,466 | 27,795,312 | 8,067 | 62.5000 | 31.2500 | 827.98 KB |
SortedSet | 2.206 ms | 0.0084 ms | 0.0078 ms | 3,708,473 | 11,000,260 | 4,551 | 23.4375 | 7.8125 | 320.24 KB |
ImmutableSet | 6.636 ms | 0.0581 ms | 0.0485 ms | 12,854,511 | 54,083,333 | 43,002 | 757.8125 | 460.9375 | 9713.33 KB |
"COW" benchmarks attempt to show the cost of COW semantics where the whole tree is rebuilt twice. These benchmarks proceed as follows:
- First, only even numbers from the sequence are inserted into the tree.
- Then, a COW copy of the tree is made and all odd numbers are inserted into the copy.
- Then, another COW copy is made and all elements are removed in reverse order of insertion.
This benchmark inserts a random sequence into the tree, then searches for each inserted element in increasing order (0, 1, ..., max).
Method | Mean | Error | StdDev | CacheMisses/Op | BranchInstructions/Op | InstructionRetired/Op |
---|---|---|---|---|---|---|
AvlTreeSet | 298.3 us | 0.87 us | 0.81 us | 87 | 375,178 | 1,208,659 |
WBTreeSet | 300.1 us | 0.75 us | 0.66 us | 79 | 383,311 | 1,229,199 |
SortedSet | 474.3 us | 2.20 us | 2.06 us | 128 | 1,050,059 | 2,901,164 |
ImmutableSet | 465.4 us | 1.92 us | 1.70 us | 103 | 1,169,875 | 2,992,643 |
Join tree has even better lookup performance than standard SortedSet
and ImmutableSet
.
This benchmark creates a vector of 16384 elements and adds 1 to each element.
Method | Mean | Error | StdDev | BranchInstructions/Op | InstructionRetired/Op | CacheMisses/Op |
---|---|---|---|---|---|---|
List | 13.12 us | 0.028 us | 0.027 us | 49,407 | 280,485 | 2 |
ImmutableList | 5,138.95 us | 20.923 us | 19.571 us | 9,529,344 | 43,057,812 | 57,847 |
DenseTrie | 392.71 us | 0.823 us | 0.729 us | 1,332,262 | 6,028,385 | 54 |
As expected, the built-in mutable list has the best performance. Still, COW DenseTrie
has significantly better performance
than the built-in immutable list.
TreeSet
has significantly better lookup performance than SortedSet
. I ascribe this to two factors:
- Sorted set uses internally a red-black tree, which is on average deeper than WB or AVL trees.
- Inspection of the generated assembly shows that JIT is able to inline key comparison method when implemented by
a abstract static interface method. This is not the case for
SortedSet
where the comparison method is a delegate.
I have attempted to convert recursive tree algorithms to iterative algorithms, using TreeIterator
as a manually
maintained stack. Surprisingly, the result was slower due to frequent calls to CORINFO_HELP_ASSIGN_REF
, which
doesn't happen when the reference is pushed onto the stack during recursion.
See dotnet/runtime#59031 This is also the reason for using ulong
for the node's transient
tag instead of object
(as is suggested in the papers).