Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle final elements in SpanHelpers.Contains for byte and char vectorized #67492

Merged
merged 4 commits into from
Apr 25, 2022

Conversation

gfoidl
Copy link
Member

@gfoidl gfoidl commented Apr 2, 2022

Description

Let's assume we have a searchSpace of length (n + 1) * Vector<T>.Count - k, where T is either byte or char, and k in (0, Vector<T>.Count).
So current implementation -- ignoring alignment for a moment -- can perform n vectorized operations, then falls back to sequential processing of the remaining Vector<T>.Count - k elements.

In numbers for byte, AVX2, n = 2, and k = 1:

Vector<byte>.Count = 32
length     = 95
vectorized = 64
sequential = 31

So as ratio there are (Vector<T>.Count - k) / (n * Vector<T>.Count) elements that need to processed sequential.
The worst case is for k = 1 and small n, i.e. for AVX2 and k = 1 31 elements need to be processed sequential.

The proposed change avoids the sequential processing of the remaining elements by reading a final vector from the end of the searchSpace.
When exiting the standard vectorized loop, we know that the searchSpace is at least Vector<T>.Count long, so it is safe to read from that end, and the operation is idempotent too.
Thus in total we do n + 1 vectorized operations.

(Note: the same / similar approach is used in #67049, and some other places where idempotency can be used (I commented quite a few times on this 😉))

Benchmark results

Notes

  • JIT doesn't hoist Vector<T>.Zero outside the loop, this is done manually with this PR and that's why for length 64 (byte) and 32 (char) a speedup is shown
  • for length 95 (byte) and 47 (char) the described effect is most visible, as this is the worst case, meaning most elements were processed sequential before this PR
  • by not jumping back to the sequential path quite a lot of comparisons get saved too, resulting in a more streamlined and predicatable instruction-flow

For the benchmarks the searchSpace is aligned to 32 bytes, to have reproducable results.

machine info
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1586 (21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.4.22181.7
  [Host]     : .NET 7.0.0 (7.0.22.17907), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.17907), X64 RyuJIT

bool Contains(ref byte searchSpace, byte value, int length)

|  Method | Length |      Mean |     Error |    StdDev | Ratio |
|-------- |------- |----------:|----------:|----------:|------:|
| Default |     63 | 14.623 ns | 0.3486 ns | 0.9946 ns |  1.00 |
|      PR |     63 | 14.479 ns | 0.2283 ns | 0.2135 ns |  0.99 |
|         |        |           |           |           |       |
| Default |     64 |  4.419 ns | 0.1277 ns | 0.1790 ns |  1.00 |
|      PR |     64 |  3.963 ns | 0.0494 ns | 0.0462 ns |  0.87 |
|         |        |           |           |           |       |
| Default |     65 |  6.412 ns | 0.1566 ns | 0.1608 ns |  1.00 |
|      PR |     65 |  4.469 ns | 0.0229 ns | 0.0203 ns |  0.70 |
|         |        |           |           |           |       |
| Default |     95 | 11.318 ns | 0.1033 ns | 0.0966 ns |  1.00 |
|      PR |     95 |  4.502 ns | 0.0543 ns | 0.0508 ns |  0.40 |
|         |        |           |           |           |       |
| Default |    100 |  8.343 ns | 0.1312 ns | 0.1096 ns |  1.00 |
|      PR |    100 |  5.246 ns | 0.1376 ns | 0.1287 ns |  0.63 |

bool Contains(ref char searchSpace, char value, int length)

|  Method | Length |      Mean |     Error |    StdDev | Ratio |
|-------- |------- |----------:|----------:|----------:|------:|
| Default |     31 | 10.349 ns | 0.2485 ns | 0.6325 ns |  1.00 |
|      PR |     31 | 10.232 ns | 0.2426 ns | 0.5998 ns |  0.99 |
|         |        |           |           |           |       |
| Default |     32 |  8.498 ns | 0.1741 ns | 0.1454 ns |  1.00 |
|      PR |     32 |  6.177 ns | 0.1473 ns | 0.1967 ns |  0.71 |
|         |        |           |           |           |       |
| Default |     33 |  8.751 ns | 0.2081 ns | 0.1946 ns |  1.00 |
|      PR |     33 |  4.655 ns | 0.1279 ns | 0.1313 ns |  0.53 |
|         |        |           |           |           |       |
| Default |     47 |  9.665 ns | 0.2301 ns | 0.3373 ns |  1.00 |
|      PR |     47 |  4.563 ns | 0.0421 ns | 0.0374 ns |  0.46 |
|         |        |           |           |           |       |
| Default |    100 |  9.096 ns | 0.1002 ns | 0.0937 ns |  1.00 |
|      PR |    100 |  8.182 ns | 0.1001 ns | 0.0836 ns |  0.90 |

Machine code (x64)

SpanHelpers.Contains(byte)
; SpanHelpersContainsByteBenchmark.Default()
       mov       rdx,[rcx+8]
       movzx     eax,byte ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F51420]
; Total bytes of code 23

; SpanHelpersContainsByteBenchmark.Contains(Byte ByRef, Byte, Int32)
       vzeroupper
       movzx     eax,dl
       mov       edx,eax
       xor       r9d,r9d
       mov       r10d,r8d
       mov       r11,r10
       cmp       r8d,40
       jl        short M01_L00
       mov       r11,rcx
       and       r11,1F
       neg       r11
       add       r11,20
       and       r11,1F
M01_L00:
       cmp       r11,8
       jb        near ptr M01_L02
M01_L01:
       add       r11,0FFFFFFFFFFFFFFF8
       movzx     r8d,byte ptr [rcx+r9]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+1]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+2]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+3]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+4]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+5]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+6]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+7]
       cmp       r8d,edx
       je        near ptr M01_L09
       add       r9,8
       cmp       r11,8
       jae       near ptr M01_L01
M01_L02:
       cmp       r11,4
       jb        short M01_L03
       add       r11,0FFFFFFFFFFFFFFFC
       movzx     r8d,byte ptr [rcx+r9]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+1]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+2]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+3]
       cmp       r8d,edx
       je        near ptr M01_L09
       add       r9,4
M01_L03:
       test      r11,r11
       je        short M01_L05
M01_L04:
       dec       r11
       movzx     r8d,byte ptr [rcx+r9]
       cmp       r8d,edx
       je        short M01_L09
       inc       r9
       test      r11,r11
       jne       short M01_L04
M01_L05:
       cmp       r9,r10
       jae       short M01_L08
       mov       r11,r10
       sub       r11,r9
       and       r11,0FFFFFFFFFFFFFFE0
       imul      r8d,eax,1010101
       vmovd     xmm0,r8d
       vpbroadcastd ymm0,xmm0
       cmp       r11,r9
       jbe       short M01_L07
       nop       dword ptr [rax]
       nop       dword ptr [rax+rax]
M01_L06:
       vpcmpeqb  ymm1,ymm0,[rcx+r9]
       vxorps    ymm2,ymm2,ymm2
       vpxor     ymm1,ymm2,ymm1
       vptest    ymm1,ymm1
       jne       short M01_L09
       add       r9,20
       cmp       r11,r9
       ja        short M01_L06
M01_L07:
       cmp       r9,r10
       jae       short M01_L08
       mov       r11,r10
       sub       r11,r9
       jmp       near ptr M01_L00
M01_L08:
       xor       eax,eax
       vzeroupper
       ret
M01_L09:
       mov       eax,1
       vzeroupper
       ret
; Total bytes of code 397
; SpanHelpersContainsByteBenchmark.PR()
       mov       rdx,[rcx+8]
       movzx     eax,byte ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F61438]
; Total bytes of code 23

; SpanHelpersContainsByteBenchmark.Contains_PR(Byte ByRef, Byte, Int32)
       push      rdi
       push      rsi
       vzeroupper
       movzx     eax,dl
       mov       edx,eax
       xor       r9d,r9d
       mov       r10d,r8d
       mov       r11,r10
       cmp       r8d,40
       jl        short M01_L00
       mov       r11,rcx
       and       r11,1F
       neg       r11
       add       r11,20
       and       r11,1F
M01_L00:
       cmp       r11,8
       jb        short M01_L02
       nop       dword ptr [rax]
M01_L01:
       add       r11,0FFFFFFFFFFFFFFF8
       lea       rsi,[rcx+r9]
       movzx     edi,byte ptr [rsi]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+1]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+2]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+3]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+4]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+5]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+6]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     esi,byte ptr [rsi+7]
       cmp       edx,esi
       je        near ptr M01_L09
       add       r9,8
       cmp       r11,8
       jae       short M01_L01
M01_L02:
       cmp       r11,4
       jb        short M01_L03
       add       r11,0FFFFFFFFFFFFFFFC
       lea       rsi,[rcx+r9]
       movzx     edi,byte ptr [rsi]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+1]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+2]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     esi,byte ptr [rsi+3]
       cmp       edx,esi
       je        near ptr M01_L09
       add       r9,4
M01_L03:
       test      r11,r11
       je        short M01_L05
       nop       dword ptr [rax+rax]
M01_L04:
       dec       r11
       movzx     esi,byte ptr [rcx+r9]
       cmp       esi,edx
       je        short M01_L09
       inc       r9
       test      r11,r11
       jne       short M01_L04
M01_L05:
       cmp       r9,r10
       jae       short M01_L08
       mov       r11,r10
       sub       r11,r9
       and       r11,0FFFFFFFFFFFFFFE0
       vxorps    ymm0,ymm0,ymm0
       imul      eax,1010101
       vmovd     xmm1,eax
       vpbroadcastd ymm1,xmm1
       cmp       r9,r11
       jae       short M01_L07
M01_L06:
       vpcmpeqb  ymm2,ymm1,[rcx+r9]
       vpxor     ymm2,ymm0,ymm2
       vptest    ymm2,ymm2
       jne       short M01_L09
       add       r9,20
       cmp       r9,r11
       jb        short M01_L06
M01_L07:
       cmp       r9,r10
       jae       short M01_L08
       add       r8d,0FFFFFFE0
       mov       r9d,r8d
       vpcmpeqb  ymm2,ymm1,[rcx+r9]
       vpxor     ymm0,ymm0,ymm2
       vptest    ymm0,ymm0
       jne       short M01_L09
M01_L08:
       xor       eax,eax
       vzeroupper
       pop       rsi
       pop       rdi
       ret
M01_L09:
       mov       eax,1
       vzeroupper
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 389
SpanHelpers.Contains(char)
; SpanHelpersContainsCharBenchmark.Default()
       mov       rdx,[rcx+8]
       movzx     eax,word ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F51420]
; Total bytes of code 23

; SpanHelpersContainsCharBenchmark.Contains(Char ByRef, Char, Int32)
       push      rax
       vzeroupper
       xor       eax,eax
       mov       [rsp],rax
       mov       [rsp],rcx
       movsxd    rax,r8d
       lea       r9,[rcx+rax*2]
       cmp       r8d,20
       jl        short M01_L00
       mov       r8d,ecx
       and       r8d,1F
       mov       eax,r8d
       shr       eax,1F
       add       eax,r8d
       sar       eax,1
       mov       r8d,eax
       neg       r8d
       add       r8d,10
       and       r8d,0F
M01_L00:
       cmp       r8d,4
       jl        short M01_L02
       movzx     r10d,dx
M01_L01:
       add       r8d,0FFFFFFFC
       movzx     eax,word ptr [rcx]
       cmp       r10d,eax
       je        near ptr M01_L08
       movzx     eax,word ptr [rcx+2]
       cmp       r10d,eax
       je        near ptr M01_L08
       movzx     eax,word ptr [rcx+4]
       cmp       r10d,eax
       je        near ptr M01_L08
       movzx     eax,word ptr [rcx+6]
       cmp       r10d,eax
       je        near ptr M01_L08
       add       rcx,8
       cmp       r8d,4
       jge       short M01_L01
M01_L02:
       test      r8d,r8d
       jle       short M01_L04
       movzx     r10d,dx
       nop
M01_L03:
       dec       r8d
       movzx     eax,word ptr [rcx]
       cmp       r10d,eax
       je        near ptr M01_L08
       add       rcx,2
       test      r8d,r8d
       jg        short M01_L03
M01_L04:
       cmp       rcx,r9
       jae       short M01_L07
       mov       r8,r9
       sub       r8,rcx
       mov       rax,r8
       shr       rax,3F
       add       rax,r8
       sar       rax,1
       mov       r8d,eax
       and       r8d,0FFFFFFF0
       movzx     r10d,dx
       imul      eax,r10d,10001
       vmovd     xmm0,eax
       vpbroadcastd ymm0,xmm0
       test      r8d,r8d
       jle       short M01_L06
M01_L05:
       vpcmpeqw  ymm1,ymm0,[rcx]
       vxorps    ymm2,ymm2,ymm2
       vpxor     ymm1,ymm2,ymm1
       vptest    ymm1,ymm1
       jne       short M01_L08
       add       rcx,20
       add       r8d,0FFFFFFF0
       test      r8d,r8d
       jg        short M01_L05
M01_L06:
       cmp       rcx,r9
       jae       short M01_L07
       mov       r8,r9
       sub       r8,rcx
       mov       rax,r8
       shr       rax,3F
       add       rax,r8
       sar       rax,1
       mov       r8d,eax
       jmp       near ptr M01_L00
M01_L07:
       xor       eax,eax
       vzeroupper
       add       rsp,8
       ret
M01_L08:
       mov       eax,1
       vzeroupper
       add       rsp,8
       ret
; Total bytes of code 311
; SpanHelpersContainsCharBenchmark.PR()
       mov       rdx,[rcx+8]
       movzx     eax,word ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F51438]
; Total bytes of code 23

; SpanHelpersContainsCharBenchmark.Contains_PR(Char ByRef, Char, Int32)
       push      rsi
       sub       rsp,10
       vzeroupper
       xor       eax,eax
       mov       [rsp+8],rax
       mov       [rsp+8],rcx
       xor       r9d,r9d
       mov       r10d,r8d
       mov       r11,r10
       cmp       r8d,20
       jl        short M01_L00
       mov       r11d,ecx
       and       r11d,1F
       shr       r11d,1
       mov       eax,r11d
       neg       eax
       add       eax,10
       and       eax,0F
       mov       r11d,eax
M01_L00:
       cmp       r11,4
       jb        short M01_L02
       movzx     r8d,dx
M01_L01:
       add       r11,0FFFFFFFFFFFFFFFC
       lea       rax,[rcx+r9*2]
       movzx     esi,word ptr [rax]
       cmp       r8d,esi
       je        near ptr M01_L08
       movzx     esi,word ptr [rax+2]
       cmp       r8d,esi
       je        near ptr M01_L08
       movzx     esi,word ptr [rax+4]
       cmp       r8d,esi
       je        near ptr M01_L08
       movzx     eax,word ptr [rax+6]
       cmp       r8d,eax
       je        near ptr M01_L08
       add       r9,4
       cmp       r11,4
       jae       short M01_L01
M01_L02:
       test      r11,r11
       je        short M01_L04
       movzx     r8d,dx
       nop       dword ptr [rax+rax]
       nop       dword ptr [rax+rax]
M01_L03:
       dec       r11
       movzx     eax,word ptr [rcx+r9*2]
       cmp       eax,r8d
       je        near ptr M01_L08
       inc       r9
       test      r11,r11
       jne       short M01_L03
M01_L04:
       cmp       r9,r10
       jae       short M01_L07
       mov       r11,r10
       sub       r11,r9
       and       r11,0FFFFFFFFFFFFFFF0
       vxorps    ymm0,ymm0,ymm0
       movzx     r8d,dx
       imul      eax,r8d,10001
       vmovd     xmm1,eax
       vpbroadcastd ymm1,xmm1
       cmp       r9,r11
       jae       short M01_L06
       nop       word ptr [rax+rax]
M01_L05:
       vpcmpeqw  ymm2,ymm1,[rcx+r9*2]
       vpxor     ymm2,ymm0,ymm2
       vptest    ymm2,ymm2
       jne       short M01_L08
       add       r9,10
       cmp       r9,r11
       jb        short M01_L05
M01_L06:
       cmp       r9,r10
       jae       short M01_L07
       vpcmpeqw  ymm2,ymm1,[rcx+r10*2+0FFE0]
       vpxor     ymm0,ymm0,ymm2
       vptest    ymm0,ymm0
       jne       short M01_L08
M01_L07:
       xor       eax,eax
       vzeroupper
       add       rsp,10
       pop       rsi
       ret
M01_L08:
       mov       eax,1
       vzeroupper
       add       rsp,10
       pop       rsi
       ret
; Total bytes of code 314

👉 If this looks good, I'd like to look into IndexOf, etc. too.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 2, 2022
@ghost
Copy link

ghost commented Apr 2, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Let's assume we have a searchSpace of length (n + 1) * Vector<T>.Count - k, where T is either byte or char, and k in (0, Vector<T>.Count).
So current implementation -- ignoring alignment for a moment -- can perform n vectorized operations, then falls back to sequential processing of the remaining Vector<T>.Count - k elements.

In numbers for byte, AVX2, n = 2, and k = 1:

Vector<byte>.Count = 32
length     = 95
vectorized = 64
sequential = 31

So as ratio there are (Vector<T>.Count - k) / (n * Vector<T>.Count) elements that need to processed sequential.
The worst case is for k = 1 and small n, i.e. for AVX2 and k = 1 31 elements need to be processed sequential.

The proposed change avoids the sequential processing of the remaining elements by reading a final vector from the end of the searchSpace.
When exiting the standard vectorized loop, we know that the searchSpace is at least Vector<T>.Count long, so it is safe to read from that end, and the operation is idempotent too.
Thus in total we do n + 1 vectorized operations.

(Note: the same / similar approach is used in #67049, and some other places where idempotency can be used (I commented quite a few times on this 😉))

Benchmark results

Notes

  • JIT doesn't hoist Vector<T>.Zero outside the loop, this is done manually with this PR and that's why for length 64 (byte) and 32 (char) a speedup is shown
  • for length 95 (byte) and 47 (char) the described effect is most visible, as this is the worst case, meaning most elements were processed sequential before this PR
  • by not jumping back to the sequential path quite a lot of comparisons get saved too, resulting in a more streamlined and predicatable instruction-flow

For the benchmarks the searchSpace is aligned to 32 bytes, to have reproducable results.

machine info
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1586 (21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.4.22181.7
  [Host]     : .NET 7.0.0 (7.0.22.17907), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.17907), X64 RyuJIT

bool Contains(ref byte searchSpace, byte value, int length)

|  Method | Length |      Mean |     Error |    StdDev | Ratio |
|-------- |------- |----------:|----------:|----------:|------:|
| Default |     63 | 14.623 ns | 0.3486 ns | 0.9946 ns |  1.00 |
|      PR |     63 | 14.479 ns | 0.2283 ns | 0.2135 ns |  0.99 |
|         |        |           |           |           |       |
| Default |     64 |  4.419 ns | 0.1277 ns | 0.1790 ns |  1.00 |
|      PR |     64 |  3.963 ns | 0.0494 ns | 0.0462 ns |  0.87 |
|         |        |           |           |           |       |
| Default |     65 |  6.412 ns | 0.1566 ns | 0.1608 ns |  1.00 |
|      PR |     65 |  4.469 ns | 0.0229 ns | 0.0203 ns |  0.70 |
|         |        |           |           |           |       |
| Default |     95 | 11.318 ns | 0.1033 ns | 0.0966 ns |  1.00 |
|      PR |     95 |  4.502 ns | 0.0543 ns | 0.0508 ns |  0.40 |
|         |        |           |           |           |       |
| Default |    100 |  8.343 ns | 0.1312 ns | 0.1096 ns |  1.00 |
|      PR |    100 |  5.246 ns | 0.1376 ns | 0.1287 ns |  0.63 |

bool Contains(ref char searchSpace, char value, int length)

|  Method | Length |      Mean |     Error |    StdDev | Ratio |
|-------- |------- |----------:|----------:|----------:|------:|
| Default |     31 | 10.349 ns | 0.2485 ns | 0.6325 ns |  1.00 |
|      PR |     31 | 10.232 ns | 0.2426 ns | 0.5998 ns |  0.99 |
|         |        |           |           |           |       |
| Default |     32 |  8.498 ns | 0.1741 ns | 0.1454 ns |  1.00 |
|      PR |     32 |  6.177 ns | 0.1473 ns | 0.1967 ns |  0.71 |
|         |        |           |           |           |       |
| Default |     33 |  8.751 ns | 0.2081 ns | 0.1946 ns |  1.00 |
|      PR |     33 |  4.655 ns | 0.1279 ns | 0.1313 ns |  0.53 |
|         |        |           |           |           |       |
| Default |     47 |  9.665 ns | 0.2301 ns | 0.3373 ns |  1.00 |
|      PR |     47 |  4.563 ns | 0.0421 ns | 0.0374 ns |  0.46 |
|         |        |           |           |           |       |
| Default |    100 |  9.096 ns | 0.1002 ns | 0.0937 ns |  1.00 |
|      PR |    100 |  8.182 ns | 0.1001 ns | 0.0836 ns |  0.90 |

Machine code (x64)

SpanHelpers.Contains(byte)
; SpanHelpersContainsByteBenchmark.Default()
       mov       rdx,[rcx+8]
       movzx     eax,byte ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F51420]
; Total bytes of code 23

; SpanHelpersContainsByteBenchmark.Contains(Byte ByRef, Byte, Int32)
       vzeroupper
       movzx     eax,dl
       mov       edx,eax
       xor       r9d,r9d
       mov       r10d,r8d
       mov       r11,r10
       cmp       r8d,40
       jl        short M01_L00
       mov       r11,rcx
       and       r11,1F
       neg       r11
       add       r11,20
       and       r11,1F
M01_L00:
       cmp       r11,8
       jb        near ptr M01_L02
M01_L01:
       add       r11,0FFFFFFFFFFFFFFF8
       movzx     r8d,byte ptr [rcx+r9]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+1]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+2]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+3]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+4]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+5]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+6]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+7]
       cmp       r8d,edx
       je        near ptr M01_L09
       add       r9,8
       cmp       r11,8
       jae       near ptr M01_L01
M01_L02:
       cmp       r11,4
       jb        short M01_L03
       add       r11,0FFFFFFFFFFFFFFFC
       movzx     r8d,byte ptr [rcx+r9]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+1]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+2]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+3]
       cmp       r8d,edx
       je        near ptr M01_L09
       add       r9,4
M01_L03:
       test      r11,r11
       je        short M01_L05
M01_L04:
       dec       r11
       movzx     r8d,byte ptr [rcx+r9]
       cmp       r8d,edx
       je        short M01_L09
       inc       r9
       test      r11,r11
       jne       short M01_L04
M01_L05:
       cmp       r9,r10
       jae       short M01_L08
       mov       r11,r10
       sub       r11,r9
       and       r11,0FFFFFFFFFFFFFFE0
       imul      r8d,eax,1010101
       vmovd     xmm0,r8d
       vpbroadcastd ymm0,xmm0
       cmp       r11,r9
       jbe       short M01_L07
       nop       dword ptr [rax]
       nop       dword ptr [rax+rax]
M01_L06:
       vpcmpeqb  ymm1,ymm0,[rcx+r9]
       vxorps    ymm2,ymm2,ymm2
       vpxor     ymm1,ymm2,ymm1
       vptest    ymm1,ymm1
       jne       short M01_L09
       add       r9,20
       cmp       r11,r9
       ja        short M01_L06
M01_L07:
       cmp       r9,r10
       jae       short M01_L08
       mov       r11,r10
       sub       r11,r9
       jmp       near ptr M01_L00
M01_L08:
       xor       eax,eax
       vzeroupper
       ret
M01_L09:
       mov       eax,1
       vzeroupper
       ret
; Total bytes of code 397
; SpanHelpersContainsByteBenchmark.PR()
       mov       rdx,[rcx+8]
       movzx     eax,byte ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F61438]
; Total bytes of code 23

; SpanHelpersContainsByteBenchmark.Contains_PR(Byte ByRef, Byte, Int32)
       push      rdi
       push      rsi
       vzeroupper
       movzx     eax,dl
       mov       edx,eax
       xor       r9d,r9d
       mov       r10d,r8d
       mov       r11,r10
       cmp       r8d,40
       jl        short M01_L00
       mov       r11,rcx
       and       r11,1F
       neg       r11
       add       r11,20
       and       r11,1F
M01_L00:
       cmp       r11,8
       jb        short M01_L02
       nop       dword ptr [rax]
M01_L01:
       add       r11,0FFFFFFFFFFFFFFF8
       lea       rsi,[rcx+r9]
       movzx     edi,byte ptr [rsi]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+1]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+2]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+3]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+4]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+5]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+6]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     esi,byte ptr [rsi+7]
       cmp       edx,esi
       je        near ptr M01_L09
       add       r9,8
       cmp       r11,8
       jae       short M01_L01
M01_L02:
       cmp       r11,4
       jb        short M01_L03
       add       r11,0FFFFFFFFFFFFFFFC
       lea       rsi,[rcx+r9]
       movzx     edi,byte ptr [rsi]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+1]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+2]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     esi,byte ptr [rsi+3]
       cmp       edx,esi
       je        near ptr M01_L09
       add       r9,4
M01_L03:
       test      r11,r11
       je        short M01_L05
       nop       dword ptr [rax+rax]
M01_L04:
       dec       r11
       movzx     esi,byte ptr [rcx+r9]
       cmp       esi,edx
       je        short M01_L09
       inc       r9
       test      r11,r11
       jne       short M01_L04
M01_L05:
       cmp       r9,r10
       jae       short M01_L08
       mov       r11,r10
       sub       r11,r9
       and       r11,0FFFFFFFFFFFFFFE0
       vxorps    ymm0,ymm0,ymm0
       imul      eax,1010101
       vmovd     xmm1,eax
       vpbroadcastd ymm1,xmm1
       cmp       r9,r11
       jae       short M01_L07
M01_L06:
       vpcmpeqb  ymm2,ymm1,[rcx+r9]
       vpxor     ymm2,ymm0,ymm2
       vptest    ymm2,ymm2
       jne       short M01_L09
       add       r9,20
       cmp       r9,r11
       jb        short M01_L06
M01_L07:
       cmp       r9,r10
       jae       short M01_L08
       add       r8d,0FFFFFFE0
       mov       r9d,r8d
       vpcmpeqb  ymm2,ymm1,[rcx+r9]
       vpxor     ymm0,ymm0,ymm2
       vptest    ymm0,ymm0
       jne       short M01_L09
M01_L08:
       xor       eax,eax
       vzeroupper
       pop       rsi
       pop       rdi
       ret
M01_L09:
       mov       eax,1
       vzeroupper
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 389
SpanHelpers.Contains(char)
; SpanHelpersContainsCharBenchmark.Default()
       mov       rdx,[rcx+8]
       movzx     eax,word ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F51420]
; Total bytes of code 23

; SpanHelpersContainsCharBenchmark.Contains(Char ByRef, Char, Int32)
       push      rax
       vzeroupper
       xor       eax,eax
       mov       [rsp],rax
       mov       [rsp],rcx
       movsxd    rax,r8d
       lea       r9,[rcx+rax*2]
       cmp       r8d,20
       jl        short M01_L00
       mov       r8d,ecx
       and       r8d,1F
       mov       eax,r8d
       shr       eax,1F
       add       eax,r8d
       sar       eax,1
       mov       r8d,eax
       neg       r8d
       add       r8d,10
       and       r8d,0F
M01_L00:
       cmp       r8d,4
       jl        short M01_L02
       movzx     r10d,dx
M01_L01:
       add       r8d,0FFFFFFFC
       movzx     eax,word ptr [rcx]
       cmp       r10d,eax
       je        near ptr M01_L08
       movzx     eax,word ptr [rcx+2]
       cmp       r10d,eax
       je        near ptr M01_L08
       movzx     eax,word ptr [rcx+4]
       cmp       r10d,eax
       je        near ptr M01_L08
       movzx     eax,word ptr [rcx+6]
       cmp       r10d,eax
       je        near ptr M01_L08
       add       rcx,8
       cmp       r8d,4
       jge       short M01_L01
M01_L02:
       test      r8d,r8d
       jle       short M01_L04
       movzx     r10d,dx
       nop
M01_L03:
       dec       r8d
       movzx     eax,word ptr [rcx]
       cmp       r10d,eax
       je        near ptr M01_L08
       add       rcx,2
       test      r8d,r8d
       jg        short M01_L03
M01_L04:
       cmp       rcx,r9
       jae       short M01_L07
       mov       r8,r9
       sub       r8,rcx
       mov       rax,r8
       shr       rax,3F
       add       rax,r8
       sar       rax,1
       mov       r8d,eax
       and       r8d,0FFFFFFF0
       movzx     r10d,dx
       imul      eax,r10d,10001
       vmovd     xmm0,eax
       vpbroadcastd ymm0,xmm0
       test      r8d,r8d
       jle       short M01_L06
M01_L05:
       vpcmpeqw  ymm1,ymm0,[rcx]
       vxorps    ymm2,ymm2,ymm2
       vpxor     ymm1,ymm2,ymm1
       vptest    ymm1,ymm1
       jne       short M01_L08
       add       rcx,20
       add       r8d,0FFFFFFF0
       test      r8d,r8d
       jg        short M01_L05
M01_L06:
       cmp       rcx,r9
       jae       short M01_L07
       mov       r8,r9
       sub       r8,rcx
       mov       rax,r8
       shr       rax,3F
       add       rax,r8
       sar       rax,1
       mov       r8d,eax
       jmp       near ptr M01_L00
M01_L07:
       xor       eax,eax
       vzeroupper
       add       rsp,8
       ret
M01_L08:
       mov       eax,1
       vzeroupper
       add       rsp,8
       ret
; Total bytes of code 311
; SpanHelpersContainsCharBenchmark.PR()
       mov       rdx,[rcx+8]
       movzx     eax,word ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F51438]
; Total bytes of code 23

; SpanHelpersContainsCharBenchmark.Contains_PR(Char ByRef, Char, Int32)
       push      rsi
       sub       rsp,10
       vzeroupper
       xor       eax,eax
       mov       [rsp+8],rax
       mov       [rsp+8],rcx
       xor       r9d,r9d
       mov       r10d,r8d
       mov       r11,r10
       cmp       r8d,20
       jl        short M01_L00
       mov       r11d,ecx
       and       r11d,1F
       shr       r11d,1
       mov       eax,r11d
       neg       eax
       add       eax,10
       and       eax,0F
       mov       r11d,eax
M01_L00:
       cmp       r11,4
       jb        short M01_L02
       movzx     r8d,dx
M01_L01:
       add       r11,0FFFFFFFFFFFFFFFC
       lea       rax,[rcx+r9*2]
       movzx     esi,word ptr [rax]
       cmp       r8d,esi
       je        near ptr M01_L08
       movzx     esi,word ptr [rax+2]
       cmp       r8d,esi
       je        near ptr M01_L08
       movzx     esi,word ptr [rax+4]
       cmp       r8d,esi
       je        near ptr M01_L08
       movzx     eax,word ptr [rax+6]
       cmp       r8d,eax
       je        near ptr M01_L08
       add       r9,4
       cmp       r11,4
       jae       short M01_L01
M01_L02:
       test      r11,r11
       je        short M01_L04
       movzx     r8d,dx
       nop       dword ptr [rax+rax]
       nop       dword ptr [rax+rax]
M01_L03:
       dec       r11
       movzx     eax,word ptr [rcx+r9*2]
       cmp       eax,r8d
       je        near ptr M01_L08
       inc       r9
       test      r11,r11
       jne       short M01_L03
M01_L04:
       cmp       r9,r10
       jae       short M01_L07
       mov       r11,r10
       sub       r11,r9
       and       r11,0FFFFFFFFFFFFFFF0
       vxorps    ymm0,ymm0,ymm0
       movzx     r8d,dx
       imul      eax,r8d,10001
       vmovd     xmm1,eax
       vpbroadcastd ymm1,xmm1
       cmp       r9,r11
       jae       short M01_L06
       nop       word ptr [rax+rax]
M01_L05:
       vpcmpeqw  ymm2,ymm1,[rcx+r9*2]
       vpxor     ymm2,ymm0,ymm2
       vptest    ymm2,ymm2
       jne       short M01_L08
       add       r9,10
       cmp       r9,r11
       jb        short M01_L05
M01_L06:
       cmp       r9,r10
       jae       short M01_L07
       vpcmpeqw  ymm2,ymm1,[rcx+r10*2+0FFE0]
       vpxor     ymm0,ymm0,ymm2
       vptest    ymm0,ymm0
       jne       short M01_L08
M01_L07:
       xor       eax,eax
       vzeroupper
       add       rsp,10
       pop       rsi
       ret
M01_L08:
       mov       eax,1
       vzeroupper
       add       rsp,10
       pop       rsi
       ret
; Total bytes of code 314

👉 If this looks good, I'd like to look into IndexOf, etc. too.

Author: gfoidl
Assignees: -
Labels:

area-System.Memory, community-contribution

Milestone: -

@gfoidl

This comment was marked as outdated.

@danmoseley
Copy link
Member

It seems we're missing benchmarks for this? (If so can we add yours?)

VS didn't do this for comments (at least in my setup) automatically :-(
@gfoidl
Copy link
Member Author

gfoidl commented Apr 3, 2022

It seems we're missing benchmarks for this? (If so can we add yours?)

Sure 😃 dotnet/performance#2347

var matches = Vector.Equals(values, LoadVector(ref searchSpace, offset));
if (Vector<byte>.Zero.Equals(matches))
matches = Vector.Equals(values, LoadVector(ref searchSpace, offset));
if (zero.Equals(matches))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like there is CQ issue in this pattern:
image

(feel free to file an issue)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More over you don't need to hoist it - it should not be used

Copy link
Member Author

@gfoidl gfoidl Apr 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, vec1 == vec2 emits better code.

For the char-overload:

M01_L05:
       vpcmpeqw  ymm2,ymm1,[rcx+r9*2]
-      vpxor     ymm2,ymm0,ymm2
       vptest    ymm2,ymm2
       jne       short M01_L08
       add       r9,10
       cmp       r9,r11
       jb        short M01_L05
M01_L06:
       cmp       r9,r10
       jae       short M01_L07
       vpcmpeqw  ymm2,ymm1,[rcx+r10*2+0FFE0]
-      vpxor     ymm0,ymm0,ymm2
       vptest    ymm0,ymm0
       jne       short M01_L08

Thanks for the hint!
Will create an issue for that --> #67500

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @gfoidl !


if (Vector.IsHardwareAccelerated && length >= Vector<byte>.Count * 2)
{
lengthToExamine = UnalignedCountVector(ref searchSpace);
}

SequentialScan:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a removal of anything goto-related is always welcomed 👍

@@ -411,10 +413,17 @@ public static bool Contains(ref byte searchSpace, byte value, int length)
goto Found;
}

if (offset < (nuint)(uint)length)
// The total length is at least Vector<byte>.Count, so instead of falling back to a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for adding the comment (otherwise it would not be obvious to me) 👍

@adamsitnik adamsitnik merged commit 1958c7e into dotnet:main Apr 25, 2022
@gfoidl gfoidl deleted the spanhelpers_final_elements_opt branch April 25, 2022 15:19
@AndyAyersMS
Copy link
Member

AndyAyersMS commented Apr 27, 2022

Did we expect any perf regressions from this? Seems like it might be related to dotnet/perf-autofiling-issues#4884
newplot - 2022-04-27T100256 921
.

@gfoidl
Copy link
Member Author

gfoidl commented Apr 30, 2022

Did we expect any perf regressions from this?

No regression is expected, rather it should be an improvement.

When I check the benchmark-code SpanHelpers.Contains isn't hit?

What is a proper way to investigate this regression?

@ghost ghost locked as resolved and limited conversation to collaborators May 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants