port BitArray.And, Or, Xor and Not to Vector<T> #72471

adamsitnik · 2022-07-19T16:57:07Z

I started the learning process by porting very simple BitArray APIs: And, Or, Xor and Not.

To my surprise, there is a notable regression:

x64

BenchmarkDotNet=v0.13.1.1799-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT

Method	Toolchain	Size	Mean	Ratio
BitArrayNot	\PR\corerun.exe	4	0.8868 ns	1.52
BitArrayNot	\baseline\corerun.exe	4	0.5938 ns	1.00

BitArrayAnd	\PR\corerun.exe	4	1.8233 ns	0.95
BitArrayAnd	\baseline\corerun.exe	4	1.9173 ns	1.00

BitArrayOr	\PR\corerun.exe	4	1.8242 ns	0.92
BitArrayOr	\baseline\corerun.exe	4	1.9889 ns	1.00

BitArrayXor	\PR\corerun.exe	4	1.8429 ns	0.93
BitArrayXor	\baseline\corerun.exe	4	1.9822 ns	1.00

BitArrayNot	\PR\corerun.exe	512	11.4026 ns	1.80
BitArrayNot	\baseline\corerun.exe	512	6.3584 ns	1.00

BitArrayAnd	\PR\corerun.exe	512	14.6244 ns	1.57
BitArrayAnd	\baseline\corerun.exe	512	9.2924 ns	1.00

BitArrayOr	\PR\corerun.exe	512	14.7221 ns	1.58
BitArrayOr	\baseline\corerun.exe	512	9.3151 ns	1.00

BitArrayXor	\PR\corerun.exe	512	14.7987 ns	1.59
BitArrayXor	\baseline\corerun.exe	512	9.2756 ns	1.00

The assembly diff for Not can be found here

arm64

BenchmarkDotNet=v0.13.1.1786-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22368.24
  [Host]     : .NET 7.0.0 (7.0.22.36704), Arm64 RyuJIT

Method	Toolchain	Size	Mean	Ratio
BitArrayNot	/PR/corerun	4	6.358 ns	0.88
BitArrayNot	/baseline/corerun	4	7.202 ns	1.00

BitArrayAnd	/PR/corerun	4	11.824 ns	1.00
BitArrayAnd	/baseline/corerun	4	11.820 ns	1.00

BitArrayOr	/PR/corerun	4	11.889 ns	1.01
BitArrayOr	/baseline/corerun	4	11.816 ns	1.00

BitArrayXor	/PR/corerun	4	11.815 ns	1.01
BitArrayXor	/baseline/corerun	4	11.708 ns	1.00

BitArrayNot	/PR/corerun	512	75.451 ns	1.94
BitArrayNot	/baseline/corerun	512	39.403 ns	1.00

BitArrayAnd	/PR/corerun	512	95.542 ns	1.84
BitArrayAnd	/baseline/corerun	512	51.936 ns	1.00

BitArrayOr	/PR/corerun	512	95.537 ns	1.86
BitArrayOr	/baseline/corerun	512	51.505 ns	1.00

BitArrayXor	/PR/corerun	512	94.904 ns	1.83
BitArrayXor	/baseline/corerun	512	51.989 ns	1.00

@tannergooding I don't want to merge this PR as is, but rather find out whether I am doing everything right and what could be done to close the gap.

cc @stephentoub

ghost · 2022-07-19T16:57:31Z

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

I started the learning process by porting very simple BitArray APIs: And, Or, Xor and Not.

To my surprise, there is a notable regression:

x64

BenchmarkDotNet=v0.13.1.1799-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT

Method	Toolchain	Size	Mean	Ratio
BitArrayNot	\PR\corerun.exe	4	0.8868 ns	1.52
BitArrayNot	\baseline\corerun.exe	4	0.5938 ns	1.00

BitArrayAnd	\PR\corerun.exe	4	1.8233 ns	0.95
BitArrayAnd	\baseline\corerun.exe	4	1.9173 ns	1.00

BitArrayOr	\PR\corerun.exe	4	1.8242 ns	0.92
BitArrayOr	\baseline\corerun.exe	4	1.9889 ns	1.00

BitArrayXor	\PR\corerun.exe	4	1.8429 ns	0.93
BitArrayXor	\baseline\corerun.exe	4	1.9822 ns	1.00

BitArrayNot	\PR\corerun.exe	512	11.4026 ns	1.80
BitArrayNot	\baseline\corerun.exe	512	6.3584 ns	1.00

BitArrayAnd	\PR\corerun.exe	512	14.6244 ns	1.57
BitArrayAnd	\baseline\corerun.exe	512	9.2924 ns	1.00

BitArrayOr	\PR\corerun.exe	512	14.7221 ns	1.58
BitArrayOr	\baseline\corerun.exe	512	9.3151 ns	1.00

BitArrayXor	\PR\corerun.exe	512	14.7987 ns	1.59
BitArrayXor	\baseline\corerun.exe	512	9.2756 ns	1.00

The assembly diff for Not can be found here

arm64

BenchmarkDotNet=v0.13.1.1786-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22368.24
  [Host]     : .NET 7.0.0 (7.0.22.36704), Arm64 RyuJIT

Method	Toolchain	Size	Mean	Ratio
BitArrayNot	/PR/corerun	4	6.358 ns	0.88
BitArrayNot	/baseline/corerun	4	7.202 ns	1.00

BitArrayAnd	/PR/corerun	4	11.824 ns	1.00
BitArrayAnd	/baseline/corerun	4	11.820 ns	1.00

BitArrayOr	/PR/corerun	4	11.889 ns	1.01
BitArrayOr	/baseline/corerun	4	11.816 ns	1.00

BitArrayXor	/PR/corerun	4	11.815 ns	1.01
BitArrayXor	/baseline/corerun	4	11.708 ns	1.00

BitArrayNot	/PR/corerun	512	75.451 ns	1.94
BitArrayNot	/baseline/corerun	512	39.403 ns	1.00

BitArrayAnd	/PR/corerun	512	95.542 ns	1.84
BitArrayAnd	/baseline/corerun	512	51.936 ns	1.00

BitArrayOr	/PR/corerun	512	95.537 ns	1.86
BitArrayOr	/baseline/corerun	512	51.505 ns	1.00

BitArrayXor	/PR/corerun	512	94.904 ns	1.83
BitArrayXor	/baseline/corerun	512	51.989 ns	1.00

@tannergooding I don't want to merge this PR as is, but rather find out whether I am doing everything right and what could be done to close the gap.

cc @stephentoub

Author:	adamsitnik
Assignees:	-
Labels:	`area-System.Collections`
Milestone:	-

tannergooding · 2022-07-19T17:04:11Z

src/libraries/System.Collections/src/System/Collections/BitArray.cs

-                }
-            }
-            else if (Vector128.IsHardwareAccelerated)
+            if (Vector.IsHardwareAccelerated)


Using Vector<T> is likely a non-starter right now.

Not only can it not be used with R2R (crossgen) code today since it is variable sized and therefore forces the code to be jitted, but it also is missing certain APIs that are available to Vector128/256<T>. A few of these functions, like ExtractMostSignificantBits can't be exposed on Vector<T>, others like LoadUnsafe could be but aren't today.

There is a design being considered (dotnet/designs#268) where we can extend Vector<T> to better work with such scenarios and better enable its usage in other scenarios, but that isn't available at the moment and its not 100% clear what shape that will end up as.

For this case in particular, Vector<T> would likely be similar in perf if we had LoadUnsafe and StoreUnsafe APIs available.

However, it would still come with the restriction that it could not participate in R2R (and therefore forces jitting on first use) and it may regress scenarios where the backing data is typically "small".

For example, Vector<T> on most modern x64 hardware is equivalent to only having the Vector256<T> path. On Arm64, its equivalent to only having the Vector128<T> path. For x64, this means that inputs less than 32-bytes (and potentially 64-bytes in the future) will behave "worse" than the equivalent on Arm64 as they'll execute as "scalar" rather than as "vector". Likewise, depending on data layout, alignment, and processor, it may behave "worse" for inputs up to ~256 bytes as well.

As indicated, these are scenarios that are being looked at and considered, but its not something that we can easily do today.

adamsitnik · 2022-07-29T14:43:49Z

I was able to reduce the gap using Unsafe APIs, but it's still not acceptable.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2

Method	Job	Size	Mean	Ratio
BitArrayNot	PR	512	7.6331 ns	1.32
BitArrayNot	main	512	5.8448 ns	1.00

BitArrayAnd	PR	512	12.5956 ns	1.30
BitArrayAnd	main	512	9.7358 ns	1.00

BitArrayOr	PR	512	12.7886 ns	1.36
BitArrayOr	main	512	9.4089 ns	1.00

BitArrayXor	PR	512	12.7228 ns	1.36
BitArrayXor	main	512	9.3926 ns	1.00

port BitArray.And, Or and Xor to Vector<T>

ac0a2a9

adamsitnik requested a review from tannergooding July 19, 2022 16:57

dotnet-issue-labeler bot added the area-System.Collections label Jul 19, 2022

ghost assigned adamsitnik Jul 19, 2022

tannergooding reviewed Jul 19, 2022

View reviewed changes

use Unsafe APIs to read and store Vector<T> to reduce the regression

e400ae1

adamsitnik closed this Jul 29, 2022

ghost locked as resolved and limited conversation to collaborators Aug 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

port BitArray.And, Or, Xor and Not to Vector<T> #72471

port BitArray.And, Or, Xor and Not to Vector<T> #72471

adamsitnik commented Jul 19, 2022

ghost commented Jul 19, 2022

x64

arm64

tannergooding Jul 19, 2022

tannergooding Jul 19, 2022

adamsitnik commented Jul 29, 2022

port BitArray.And, Or, Xor and Not to Vector<T> #72471

port BitArray.And, Or, Xor and Not to Vector<T> #72471

Conversation

adamsitnik commented Jul 19, 2022

x64

arm64

ghost commented Jul 19, 2022

x64

arm64

tannergooding Jul 19, 2022

Choose a reason for hiding this comment

tannergooding Jul 19, 2022

Choose a reason for hiding this comment

adamsitnik commented Jul 29, 2022