You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My goal with this experiment was to identify whether we could improve the performance of the BCL's generic hash containers using vectorization.
Rationale
Existing containers like F14 and Swiss Tables demonstrate that it's possible to parallelize hashtable search operations using wide registers and/or vector instructions. There's also a body of research showing that you can pair cache line optimizations with this work to reduce the number of cache misses during lookups.
We previously replaced the widely used GHashTable in the mono runtime with a F14-style vectorized hashtable, dn_simdhash, with good results, which provided motivation for this experiment.
S.C.G.Dictionary is an extensively used container that shows up on many hot paths, so I started there.
What was done
Two different vectorized versions of S.C.G.Dictionary were prototyped out-of-tree, one unordered one and one ordered. The unordered version was then integrated into the BCL as a replacement for S.C.G in order to run tests and take measurements. For multiple reasons it was deemed unsatisfactory, so a new ordered vectorized version was written in-place using the original scalar Dictionary implementation as a starting point, and then measurements were taken.
The ordered dictionary passed all relevant tests with the exception of a few specific ones that did things like modify internals via reflection or try to engineer hash collisions.
A "Container Design Notes" document was drafted containing takeaways from this research to simplify further work on this container.
High-level design
Both the unordered and ordered dictionaries used a design inspired by F14, where buckets were 16-byte vectors with a single hash-derived 'suffix byte' assigned to each entry, and spare bytes were used to store metadata.
In the unordered dictionary the keys and values lived next to the suffixes in each bucket, removing the need for an Entries array.
In the ordered dictionary the suffixes are accompanied by an indices array, where each index points into the Entries array, just like in the existing scalar Dictionary.
Where a scalar dictionary looks up a single bucket chosen based on the key's hashcode and immediately leaps into the Entries array and follows a linked list, an F14-style dictionary locates a bucket and then scans all the suffixes inside of the bucket to find one that matches the hash. This allows for efficient handling of small numbers of collisions with fewer hash comparisons and less linked list traversal, assuming you have a quality hash function.
Measurements
The unordered dictionary was (depending on scenario, varying across Intel and AMD) 10-75% faster than existing scalar S.C.G.Dictionary. See dotnet/runtime#108098 for more detailed numbers.
The ordered vectorized dictionary was (depending on scenario) 5-50% slower than existing scalar S.C.G.Dictionary.
Key obstacles to the unordered dictionary
Our existing Dictionary has an unstated ordering guarantee, unlike similar containers in many other standard libraries. Real applications depend on this guarantee, so any replacement has to provide it.
The unordered dictionary suffered from being authored from scratch out-of-tree which caused lots of subtle binary compatibility issues once it was integrated due to i.e. Values being a struct instead of a class. Source compatibility is not enough, even for prototyping - it breaks tests and benchmarks.
1 & 2 alone were enough to disqualify the unordered prototype for further investigation so I moved on to the ordered one and wrote it in-tree.
Conclusions and next steps
The unordered prototype shows that it's possible to deliver better performance than a scalar Dictionary in pure C# without too much work, provided your constraints allow for it. The language, JIT and standard library were not obstacles here and performed admirably. The failure to deliver better performance here was mostly due to constraints.
It's possible that HashSet or other containers in the BCL could still benefit from vectorization, even though Dictionary did not in practice, so that is an area for future investigation.
The existing dotnet/performance benchmark suite does not appear to have many "worst-case" measurements for specific types of engineered bad hashes. It may be worthwhile to add measurements like this when doing future work on containers.
The text was updated successfully, but these errors were encountered:
My goal with this experiment was to identify whether we could improve the performance of the BCL's generic hash containers using vectorization.
Rationale
Existing containers like F14 and Swiss Tables demonstrate that it's possible to parallelize hashtable search operations using wide registers and/or vector instructions. There's also a body of research showing that you can pair cache line optimizations with this work to reduce the number of cache misses during lookups.
We previously replaced the widely used GHashTable in the mono runtime with a F14-style vectorized hashtable, dn_simdhash, with good results, which provided motivation for this experiment.
S.C.G.Dictionary is an extensively used container that shows up on many hot paths, so I started there.
What was done
Two different vectorized versions of S.C.G.Dictionary were prototyped out-of-tree, one unordered one and one ordered. The unordered version was then integrated into the BCL as a replacement for S.C.G in order to run tests and take measurements. For multiple reasons it was deemed unsatisfactory, so a new ordered vectorized version was written in-place using the original scalar Dictionary implementation as a starting point, and then measurements were taken.
The ordered dictionary passed all relevant tests with the exception of a few specific ones that did things like modify internals via reflection or try to engineer hash collisions.
A "Container Design Notes" document was drafted containing takeaways from this research to simplify further work on this container.
High-level design
Both the unordered and ordered dictionaries used a design inspired by F14, where buckets were 16-byte vectors with a single hash-derived 'suffix byte' assigned to each entry, and spare bytes were used to store metadata.
In the unordered dictionary the keys and values lived next to the suffixes in each bucket, removing the need for an Entries array.
In the ordered dictionary the suffixes are accompanied by an indices array, where each index points into the Entries array, just like in the existing scalar Dictionary.
Where a scalar dictionary looks up a single bucket chosen based on the key's hashcode and immediately leaps into the Entries array and follows a linked list, an F14-style dictionary locates a bucket and then scans all the suffixes inside of the bucket to find one that matches the hash. This allows for efficient handling of small numbers of collisions with fewer hash comparisons and less linked list traversal, assuming you have a quality hash function.
Measurements
The unordered dictionary was (depending on scenario, varying across Intel and AMD) 10-75% faster than existing scalar S.C.G.Dictionary. See dotnet/runtime#108098 for more detailed numbers.
The ordered vectorized dictionary was (depending on scenario) 5-50% slower than existing scalar S.C.G.Dictionary.
Key obstacles to the unordered dictionary
1 & 2 alone were enough to disqualify the unordered prototype for further investigation so I moved on to the ordered one and wrote it in-tree.
Conclusions and next steps
The unordered prototype shows that it's possible to deliver better performance than a scalar Dictionary in pure C# without too much work, provided your constraints allow for it. The language, JIT and standard library were not obstacles here and performed admirably. The failure to deliver better performance here was mostly due to constraints.
It's possible that HashSet or other containers in the BCL could still benefit from vectorization, even though Dictionary did not in practice, so that is an area for future investigation.
The existing dotnet/performance benchmark suite does not appear to have many "worst-case" measurements for specific types of engineered bad hashes. It may be worthwhile to add measurements like this when doing future work on containers.
The text was updated successfully, but these errors were encountered: