Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port BitArray.And, Or, Xor and Not to Vector<T> #72471

Closed
wants to merge 2 commits into from

Conversation

adamsitnik
Copy link
Member

I started the learning process by porting very simple BitArray APIs: And, Or, Xor and Not.

To my surprise, there is a notable regression:

x64

BenchmarkDotNet=v0.13.1.1799-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT
Method Toolchain Size Mean Ratio
BitArrayNot \PR\corerun.exe 4 0.8868 ns 1.52
BitArrayNot \baseline\corerun.exe 4 0.5938 ns 1.00
BitArrayAnd \PR\corerun.exe 4 1.8233 ns 0.95
BitArrayAnd \baseline\corerun.exe 4 1.9173 ns 1.00
BitArrayOr \PR\corerun.exe 4 1.8242 ns 0.92
BitArrayOr \baseline\corerun.exe 4 1.9889 ns 1.00
BitArrayXor \PR\corerun.exe 4 1.8429 ns 0.93
BitArrayXor \baseline\corerun.exe 4 1.9822 ns 1.00
BitArrayNot \PR\corerun.exe 512 11.4026 ns 1.80
BitArrayNot \baseline\corerun.exe 512 6.3584 ns 1.00
BitArrayAnd \PR\corerun.exe 512 14.6244 ns 1.57
BitArrayAnd \baseline\corerun.exe 512 9.2924 ns 1.00
BitArrayOr \PR\corerun.exe 512 14.7221 ns 1.58
BitArrayOr \baseline\corerun.exe 512 9.3151 ns 1.00
BitArrayXor \PR\corerun.exe 512 14.7987 ns 1.59
BitArrayXor \baseline\corerun.exe 512 9.2756 ns 1.00

The assembly diff for Not can be found here

arm64

BenchmarkDotNet=v0.13.1.1786-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22368.24
  [Host]     : .NET 7.0.0 (7.0.22.36704), Arm64 RyuJIT
Method Toolchain Size Mean Ratio
BitArrayNot /PR/corerun 4 6.358 ns 0.88
BitArrayNot /baseline/corerun 4 7.202 ns 1.00
BitArrayAnd /PR/corerun 4 11.824 ns 1.00
BitArrayAnd /baseline/corerun 4 11.820 ns 1.00
BitArrayOr /PR/corerun 4 11.889 ns 1.01
BitArrayOr /baseline/corerun 4 11.816 ns 1.00
BitArrayXor /PR/corerun 4 11.815 ns 1.01
BitArrayXor /baseline/corerun 4 11.708 ns 1.00
BitArrayNot /PR/corerun 512 75.451 ns 1.94
BitArrayNot /baseline/corerun 512 39.403 ns 1.00
BitArrayAnd /PR/corerun 512 95.542 ns 1.84
BitArrayAnd /baseline/corerun 512 51.936 ns 1.00
BitArrayOr /PR/corerun 512 95.537 ns 1.86
BitArrayOr /baseline/corerun 512 51.505 ns 1.00
BitArrayXor /PR/corerun 512 94.904 ns 1.83
BitArrayXor /baseline/corerun 512 51.989 ns 1.00

@tannergooding I don't want to merge this PR as is, but rather find out whether I am doing everything right and what could be done to close the gap.

cc @stephentoub

@ghost
Copy link

ghost commented Jul 19, 2022

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

I started the learning process by porting very simple BitArray APIs: And, Or, Xor and Not.

To my surprise, there is a notable regression:

x64

BenchmarkDotNet=v0.13.1.1799-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT
Method Toolchain Size Mean Ratio
BitArrayNot \PR\corerun.exe 4 0.8868 ns 1.52
BitArrayNot \baseline\corerun.exe 4 0.5938 ns 1.00
BitArrayAnd \PR\corerun.exe 4 1.8233 ns 0.95
BitArrayAnd \baseline\corerun.exe 4 1.9173 ns 1.00
BitArrayOr \PR\corerun.exe 4 1.8242 ns 0.92
BitArrayOr \baseline\corerun.exe 4 1.9889 ns 1.00
BitArrayXor \PR\corerun.exe 4 1.8429 ns 0.93
BitArrayXor \baseline\corerun.exe 4 1.9822 ns 1.00
BitArrayNot \PR\corerun.exe 512 11.4026 ns 1.80
BitArrayNot \baseline\corerun.exe 512 6.3584 ns 1.00
BitArrayAnd \PR\corerun.exe 512 14.6244 ns 1.57
BitArrayAnd \baseline\corerun.exe 512 9.2924 ns 1.00
BitArrayOr \PR\corerun.exe 512 14.7221 ns 1.58
BitArrayOr \baseline\corerun.exe 512 9.3151 ns 1.00
BitArrayXor \PR\corerun.exe 512 14.7987 ns 1.59
BitArrayXor \baseline\corerun.exe 512 9.2756 ns 1.00

The assembly diff for Not can be found here

arm64

BenchmarkDotNet=v0.13.1.1786-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22368.24
  [Host]     : .NET 7.0.0 (7.0.22.36704), Arm64 RyuJIT
Method Toolchain Size Mean Ratio
BitArrayNot /PR/corerun 4 6.358 ns 0.88
BitArrayNot /baseline/corerun 4 7.202 ns 1.00
BitArrayAnd /PR/corerun 4 11.824 ns 1.00
BitArrayAnd /baseline/corerun 4 11.820 ns 1.00
BitArrayOr /PR/corerun 4 11.889 ns 1.01
BitArrayOr /baseline/corerun 4 11.816 ns 1.00
BitArrayXor /PR/corerun 4 11.815 ns 1.01
BitArrayXor /baseline/corerun 4 11.708 ns 1.00
BitArrayNot /PR/corerun 512 75.451 ns 1.94
BitArrayNot /baseline/corerun 512 39.403 ns 1.00
BitArrayAnd /PR/corerun 512 95.542 ns 1.84
BitArrayAnd /baseline/corerun 512 51.936 ns 1.00
BitArrayOr /PR/corerun 512 95.537 ns 1.86
BitArrayOr /baseline/corerun 512 51.505 ns 1.00
BitArrayXor /PR/corerun 512 94.904 ns 1.83
BitArrayXor /baseline/corerun 512 51.989 ns 1.00

@tannergooding I don't want to merge this PR as is, but rather find out whether I am doing everything right and what could be done to close the gap.

cc @stephentoub

Author: adamsitnik
Assignees: -
Labels:

area-System.Collections

Milestone: -

}
}
else if (Vector128.IsHardwareAccelerated)
if (Vector.IsHardwareAccelerated)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Vector<T> is likely a non-starter right now.

Not only can it not be used with R2R (crossgen) code today since it is variable sized and therefore forces the code to be jitted, but it also is missing certain APIs that are available to Vector128/256<T>. A few of these functions, like ExtractMostSignificantBits can't be exposed on Vector<T>, others like LoadUnsafe could be but aren't today.

There is a design being considered (dotnet/designs#268) where we can extend Vector<T> to better work with such scenarios and better enable its usage in other scenarios, but that isn't available at the moment and its not 100% clear what shape that will end up as.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case in particular, Vector<T> would likely be similar in perf if we had LoadUnsafe and StoreUnsafe APIs available.

However, it would still come with the restriction that it could not participate in R2R (and therefore forces jitting on first use) and it may regress scenarios where the backing data is typically "small".

For example, Vector<T> on most modern x64 hardware is equivalent to only having the Vector256<T> path. On Arm64, its equivalent to only having the Vector128<T> path. For x64, this means that inputs less than 32-bytes (and potentially 64-bytes in the future) will behave "worse" than the equivalent on Arm64 as they'll execute as "scalar" rather than as "vector". Likewise, depending on data layout, alignment, and processor, it may behave "worse" for inputs up to ~256 bytes as well.

As indicated, these are scenarios that are being looked at and considered, but its not something that we can easily do today.

@adamsitnik
Copy link
Member Author

I was able to reduce the gap using Unsafe APIs, but it's still not acceptable.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
Method Job Size Mean Ratio
BitArrayNot PR 512 7.6331 ns 1.32
BitArrayNot main 512 5.8448 ns 1.00
BitArrayAnd PR 512 12.5956 ns 1.30
BitArrayAnd main 512 9.7358 ns 1.00
BitArrayOr PR 512 12.7886 ns 1.36
BitArrayOr main 512 9.4089 ns 1.00
BitArrayXor PR 512 12.7228 ns 1.36
BitArrayXor main 512 9.3926 ns 1.00

@adamsitnik adamsitnik closed this Jul 29, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Aug 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants