Handle final elements in SpanHelpers.Contains for byte and char vectorized #67492

gfoidl · 2022-04-02T21:27:25Z

Description

Let's assume we have a searchSpace of length (n + 1) * Vector<T>.Count - k, where T is either byte or char, and k in (0, Vector<T>.Count).
So current implementation -- ignoring alignment for a moment -- can perform n vectorized operations, then falls back to sequential processing of the remaining Vector<T>.Count - k elements.

In numbers for byte, AVX2, n = 2, and k = 1:

Vector<byte>.Count = 32
length     = 95
vectorized = 64
sequential = 31

So as ratio there are (Vector<T>.Count - k) / (n * Vector<T>.Count) elements that need to processed sequential.
The worst case is for k = 1 and small n, i.e. for AVX2 and k = 1 31 elements need to be processed sequential.

The proposed change avoids the sequential processing of the remaining elements by reading a final vector from the end of the searchSpace.
When exiting the standard vectorized loop, we know that the searchSpace is at least Vector<T>.Count long, so it is safe to read from that end, and the operation is idempotent too.
Thus in total we do n + 1 vectorized operations.

(Note: the same / similar approach is used in #67049, and some other places where idempotency can be used (I commented quite a few times on this 😉))

Benchmark results

Notes

JIT doesn't hoist Vector<T>.Zero outside the loop, this is done manually with this PR and that's why for length 64 (byte) and 32 (char) a speedup is shown
for length 95 (byte) and 47 (char) the described effect is most visible, as this is the worst case, meaning most elements were processed sequential before this PR
by not jumping back to the sequential path quite a lot of comparisons get saved too, resulting in a more streamlined and predicatable instruction-flow

For the benchmarks the searchSpace is aligned to 32 bytes, to have reproducable results.

machine info

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1586 (21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.4.22181.7
  [Host]     : .NET 7.0.0 (7.0.22.17907), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.17907), X64 RyuJIT

bool Contains(ref byte searchSpace, byte value, int length)

|  Method | Length |      Mean |     Error |    StdDev | Ratio |
|-------- |------- |----------:|----------:|----------:|------:|
| Default |     63 | 14.623 ns | 0.3486 ns | 0.9946 ns |  1.00 |
|      PR |     63 | 14.479 ns | 0.2283 ns | 0.2135 ns |  0.99 |
|         |        |           |           |           |       |
| Default |     64 |  4.419 ns | 0.1277 ns | 0.1790 ns |  1.00 |
|      PR |     64 |  3.963 ns | 0.0494 ns | 0.0462 ns |  0.87 |
|         |        |           |           |           |       |
| Default |     65 |  6.412 ns | 0.1566 ns | 0.1608 ns |  1.00 |
|      PR |     65 |  4.469 ns | 0.0229 ns | 0.0203 ns |  0.70 |
|         |        |           |           |           |       |
| Default |     95 | 11.318 ns | 0.1033 ns | 0.0966 ns |  1.00 |
|      PR |     95 |  4.502 ns | 0.0543 ns | 0.0508 ns |  0.40 |
|         |        |           |           |           |       |
| Default |    100 |  8.343 ns | 0.1312 ns | 0.1096 ns |  1.00 |
|      PR |    100 |  5.246 ns | 0.1376 ns | 0.1287 ns |  0.63 |

bool Contains(ref char searchSpace, char value, int length)

|  Method | Length |      Mean |     Error |    StdDev | Ratio |
|-------- |------- |----------:|----------:|----------:|------:|
| Default |     31 | 10.349 ns | 0.2485 ns | 0.6325 ns |  1.00 |
|      PR |     31 | 10.232 ns | 0.2426 ns | 0.5998 ns |  0.99 |
|         |        |           |           |           |       |
| Default |     32 |  8.498 ns | 0.1741 ns | 0.1454 ns |  1.00 |
|      PR |     32 |  6.177 ns | 0.1473 ns | 0.1967 ns |  0.71 |
|         |        |           |           |           |       |
| Default |     33 |  8.751 ns | 0.2081 ns | 0.1946 ns |  1.00 |
|      PR |     33 |  4.655 ns | 0.1279 ns | 0.1313 ns |  0.53 |
|         |        |           |           |           |       |
| Default |     47 |  9.665 ns | 0.2301 ns | 0.3373 ns |  1.00 |
|      PR |     47 |  4.563 ns | 0.0421 ns | 0.0374 ns |  0.46 |
|         |        |           |           |           |       |
| Default |    100 |  9.096 ns | 0.1002 ns | 0.0937 ns |  1.00 |
|      PR |    100 |  8.182 ns | 0.1001 ns | 0.0836 ns |  0.90 |

Machine code (x64)

SpanHelpers.Contains(byte)

; SpanHelpersContainsByteBenchmark.Default()
       mov       rdx,[rcx+8]
       movzx     eax,byte ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F51420]
; Total bytes of code 23

; SpanHelpersContainsByteBenchmark.Contains(Byte ByRef, Byte, Int32)
       vzeroupper
       movzx     eax,dl
       mov       edx,eax
       xor       r9d,r9d
       mov       r10d,r8d
       mov       r11,r10
       cmp       r8d,40
       jl        short M01_L00
       mov       r11,rcx
       and       r11,1F
       neg       r11
       add       r11,20
       and       r11,1F
M01_L00:
       cmp       r11,8
       jb        near ptr M01_L02
M01_L01:
       add       r11,0FFFFFFFFFFFFFFF8
       movzx     r8d,byte ptr [rcx+r9]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+1]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+2]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+3]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+4]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+5]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+6]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+7]
       cmp       r8d,edx
       je        near ptr M01_L09
       add       r9,8
       cmp       r11,8
       jae       near ptr M01_L01
M01_L02:
       cmp       r11,4
       jb        short M01_L03
       add       r11,0FFFFFFFFFFFFFFFC
       movzx     r8d,byte ptr [rcx+r9]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+1]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+2]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+3]
       cmp       r8d,edx
       je        near ptr M01_L09
       add       r9,4
M01_L03:
       test      r11,r11
       je        short M01_L05
M01_L04:
       dec       r11
       movzx     r8d,byte ptr [rcx+r9]
       cmp       r8d,edx
       je        short M01_L09
       inc       r9
       test      r11,r11
       jne       short M01_L04
M01_L05:
       cmp       r9,r10
       jae       short M01_L08
       mov       r11,r10
       sub       r11,r9
       and       r11,0FFFFFFFFFFFFFFE0
       imul      r8d,eax,1010101
       vmovd     xmm0,r8d
       vpbroadcastd ymm0,xmm0
       cmp       r11,r9
       jbe       short M01_L07
       nop       dword ptr [rax]
       nop       dword ptr [rax+rax]
M01_L06:
       vpcmpeqb  ymm1,ymm0,[rcx+r9]
       vxorps    ymm2,ymm2,ymm2
       vpxor     ymm1,ymm2,ymm1
       vptest    ymm1,ymm1
       jne       short M01_L09
       add       r9,20
       cmp       r11,r9
       ja        short M01_L06
M01_L07:
       cmp       r9,r10
       jae       short M01_L08
       mov       r11,r10
       sub       r11,r9
       jmp       near ptr M01_L00
M01_L08:
       xor       eax,eax
       vzeroupper
       ret
M01_L09:
       mov       eax,1
       vzeroupper
       ret
; Total bytes of code 397

; SpanHelpersContainsByteBenchmark.PR()
       mov       rdx,[rcx+8]
       movzx     eax,byte ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F61438]
; Total bytes of code 23

; SpanHelpersContainsByteBenchmark.Contains_PR(Byte ByRef, Byte, Int32)
       push      rdi
       push      rsi
       vzeroupper
       movzx     eax,dl
       mov       edx,eax
       xor       r9d,r9d
       mov       r10d,r8d
       mov       r11,r10
       cmp       r8d,40
       jl        short M01_L00
       mov       r11,rcx
       and       r11,1F
       neg       r11
       add       r11,20
       and       r11,1F
M01_L00:
       cmp       r11,8
       jb        short M01_L02
       nop       dword ptr [rax]
M01_L01:
       add       r11,0FFFFFFFFFFFFFFF8
       lea       rsi,[rcx+r9]
       movzx     edi,byte ptr [rsi]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+1]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+2]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+3]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+4]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+5]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+6]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     esi,byte ptr [rsi+7]
       cmp       edx,esi
       je        near ptr M01_L09
       add       r9,8
       cmp       r11,8
       jae       short M01_L01
M01_L02:
       cmp       r11,4
       jb        short M01_L03
       add       r11,0FFFFFFFFFFFFFFFC
       lea       rsi,[rcx+r9]
       movzx     edi,byte ptr [rsi]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+1]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+2]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     esi,byte ptr [rsi+3]
       cmp       edx,esi
       je        near ptr M01_L09
       add       r9,4
M01_L03:
       test      r11,r11
       je        short M01_L05
       nop       dword ptr [rax+rax]
M01_L04:
       dec       r11
       movzx     esi,byte ptr [rcx+r9]
       cmp       esi,edx
       je        short M01_L09
       inc       r9
       test      r11,r11
       jne       short M01_L04
M01_L05:
       cmp       r9,r10
       jae       short M01_L08
       mov       r11,r10
       sub       r11,r9
       and       r11,0FFFFFFFFFFFFFFE0
       vxorps    ymm0,ymm0,ymm0
       imul      eax,1010101
       vmovd     xmm1,eax
       vpbroadcastd ymm1,xmm1
       cmp       r9,r11
       jae       short M01_L07
M01_L06:
       vpcmpeqb  ymm2,ymm1,[rcx+r9]
       vpxor     ymm2,ymm0,ymm2
       vptest    ymm2,ymm2
       jne       short M01_L09
       add       r9,20
       cmp       r9,r11
       jb        short M01_L06
M01_L07:
       cmp       r9,r10
       jae       short M01_L08
       add       r8d,0FFFFFFE0
       mov       r9d,r8d
       vpcmpeqb  ymm2,ymm1,[rcx+r9]
       vpxor     ymm0,ymm0,ymm2
       vptest    ymm0,ymm0
       jne       short M01_L09
M01_L08:
       xor       eax,eax
       vzeroupper
       pop       rsi
       pop       rdi
       ret
M01_L09:
       mov       eax,1
       vzeroupper
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 389

SpanHelpers.Contains(char)

; SpanHelpersContainsCharBenchmark.Default()
       mov       rdx,[rcx+8]
       movzx     eax,word ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F51420]
; Total bytes of code 23

; SpanHelpersContainsCharBenchmark.Contains(Char ByRef, Char, Int32)
       push      rax
       vzeroupper
       xor       eax,eax
       mov       [rsp],rax
       mov       [rsp],rcx
       movsxd    rax,r8d
       lea       r9,[rcx+rax*2]
       cmp       r8d,20
       jl        short M01_L00
       mov       r8d,ecx
       and       r8d,1F
       mov       eax,r8d
       shr       eax,1F
       add       eax,r8d
       sar       eax,1
       mov       r8d,eax
       neg       r8d
       add       r8d,10
       and       r8d,0F
M01_L00:
       cmp       r8d,4
       jl        short M01_L02
       movzx     r10d,dx
M01_L01:
       add       r8d,0FFFFFFFC
       movzx     eax,word ptr [rcx]
       cmp       r10d,eax
       je        near ptr M01_L08
       movzx     eax,word ptr [rcx+2]
       cmp       r10d,eax
       je        near ptr M01_L08
       movzx     eax,word ptr [rcx+4]
       cmp       r10d,eax
       je        near ptr M01_L08
       movzx     eax,word ptr [rcx+6]
       cmp       r10d,eax
       je        near ptr M01_L08
       add       rcx,8
       cmp       r8d,4
       jge       short M01_L01
M01_L02:
       test      r8d,r8d
       jle       short M01_L04
       movzx     r10d,dx
       nop
M01_L03:
       dec       r8d
       movzx     eax,word ptr [rcx]
       cmp       r10d,eax
       je        near ptr M01_L08
       add       rcx,2
       test      r8d,r8d
       jg        short M01_L03
M01_L04:
       cmp       rcx,r9
       jae       short M01_L07
       mov       r8,r9
       sub       r8,rcx
       mov       rax,r8
       shr       rax,3F
       add       rax,r8
       sar       rax,1
       mov       r8d,eax
       and       r8d,0FFFFFFF0
       movzx     r10d,dx
       imul      eax,r10d,10001
       vmovd     xmm0,eax
       vpbroadcastd ymm0,xmm0
       test      r8d,r8d
       jle       short M01_L06
M01_L05:
       vpcmpeqw  ymm1,ymm0,[rcx]
       vxorps    ymm2,ymm2,ymm2
       vpxor     ymm1,ymm2,ymm1
       vptest    ymm1,ymm1
       jne       short M01_L08
       add       rcx,20
       add       r8d,0FFFFFFF0
       test      r8d,r8d
       jg        short M01_L05
M01_L06:
       cmp       rcx,r9
       jae       short M01_L07
       mov       r8,r9
       sub       r8,rcx
       mov       rax,r8
       shr       rax,3F
       add       rax,r8
       sar       rax,1
       mov       r8d,eax
       jmp       near ptr M01_L00
M01_L07:
       xor       eax,eax
       vzeroupper
       add       rsp,8
       ret
M01_L08:
       mov       eax,1
       vzeroupper
       add       rsp,8
       ret
; Total bytes of code 311

; SpanHelpersContainsCharBenchmark.PR()
       mov       rdx,[rcx+8]
       movzx     eax,word ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F51438]
; Total bytes of code 23

; SpanHelpersContainsCharBenchmark.Contains_PR(Char ByRef, Char, Int32)
       push      rsi
       sub       rsp,10
       vzeroupper
       xor       eax,eax
       mov       [rsp+8],rax
       mov       [rsp+8],rcx
       xor       r9d,r9d
       mov       r10d,r8d
       mov       r11,r10
       cmp       r8d,20
       jl        short M01_L00
       mov       r11d,ecx
       and       r11d,1F
       shr       r11d,1
       mov       eax,r11d
       neg       eax
       add       eax,10
       and       eax,0F
       mov       r11d,eax
M01_L00:
       cmp       r11,4
       jb        short M01_L02
       movzx     r8d,dx
M01_L01:
       add       r11,0FFFFFFFFFFFFFFFC
       lea       rax,[rcx+r9*2]
       movzx     esi,word ptr [rax]
       cmp       r8d,esi
       je        near ptr M01_L08
       movzx     esi,word ptr [rax+2]
       cmp       r8d,esi
       je        near ptr M01_L08
       movzx     esi,word ptr [rax+4]
       cmp       r8d,esi
       je        near ptr M01_L08
       movzx     eax,word ptr [rax+6]
       cmp       r8d,eax
       je        near ptr M01_L08
       add       r9,4
       cmp       r11,4
       jae       short M01_L01
M01_L02:
       test      r11,r11
       je        short M01_L04
       movzx     r8d,dx
       nop       dword ptr [rax+rax]
       nop       dword ptr [rax+rax]
M01_L03:
       dec       r11
       movzx     eax,word ptr [rcx+r9*2]
       cmp       eax,r8d
       je        near ptr M01_L08
       inc       r9
       test      r11,r11
       jne       short M01_L03
M01_L04:
       cmp       r9,r10
       jae       short M01_L07
       mov       r11,r10
       sub       r11,r9
       and       r11,0FFFFFFFFFFFFFFF0
       vxorps    ymm0,ymm0,ymm0
       movzx     r8d,dx
       imul      eax,r8d,10001
       vmovd     xmm1,eax
       vpbroadcastd ymm1,xmm1
       cmp       r9,r11
       jae       short M01_L06
       nop       word ptr [rax+rax]
M01_L05:
       vpcmpeqw  ymm2,ymm1,[rcx+r9*2]
       vpxor     ymm2,ymm0,ymm2
       vptest    ymm2,ymm2
       jne       short M01_L08
       add       r9,10
       cmp       r9,r11
       jb        short M01_L05
M01_L06:
       cmp       r9,r10
       jae       short M01_L07
       vpcmpeqw  ymm2,ymm1,[rcx+r10*2+0FFE0]
       vpxor     ymm0,ymm0,ymm2
       vptest    ymm0,ymm0
       jne       short M01_L08
M01_L07:
       xor       eax,eax
       vzeroupper
       add       rsp,10
       pop       rsi
       ret
M01_L08:
       mov       eax,1
       vzeroupper
       add       rsp,10
       pop       rsi
       ret
; Total bytes of code 314

👉 If this looks good, I'd like to look into IndexOf, etc. too.

…ctorized

ghost · 2022-04-02T21:27:36Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Let's assume we have a searchSpace of length (n + 1) * Vector<T>.Count - k, where T is either byte or char, and k in (0, Vector<T>.Count).
So current implementation -- ignoring alignment for a moment -- can perform n vectorized operations, then falls back to sequential processing of the remaining Vector<T>.Count - k elements.

In numbers for byte, AVX2, n = 2, and k = 1:

Vector<byte>.Count = 32
length     = 95
vectorized = 64
sequential = 31

So as ratio there are (Vector<T>.Count - k) / (n * Vector<T>.Count) elements that need to processed sequential.
The worst case is for k = 1 and small n, i.e. for AVX2 and k = 1 31 elements need to be processed sequential.

The proposed change avoids the sequential processing of the remaining elements by reading a final vector from the end of the searchSpace.
When exiting the standard vectorized loop, we know that the searchSpace is at least Vector<T>.Count long, so it is safe to read from that end, and the operation is idempotent too.
Thus in total we do n + 1 vectorized operations.

(Note: the same / similar approach is used in #67049, and some other places where idempotency can be used (I commented quite a few times on this 😉))

Benchmark results

Notes

JIT doesn't hoist Vector<T>.Zero outside the loop, this is done manually with this PR and that's why for length 64 (byte) and 32 (char) a speedup is shown
for length 95 (byte) and 47 (char) the described effect is most visible, as this is the worst case, meaning most elements were processed sequential before this PR
by not jumping back to the sequential path quite a lot of comparisons get saved too, resulting in a more streamlined and predicatable instruction-flow

For the benchmarks the searchSpace is aligned to 32 bytes, to have reproducable results.

machine info

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1586 (21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.4.22181.7
  [Host]     : .NET 7.0.0 (7.0.22.17907), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.17907), X64 RyuJIT

bool Contains(ref byte searchSpace, byte value, int length)

|  Method | Length |      Mean |     Error |    StdDev | Ratio |
|-------- |------- |----------:|----------:|----------:|------:|
| Default |     63 | 14.623 ns | 0.3486 ns | 0.9946 ns |  1.00 |
|      PR |     63 | 14.479 ns | 0.2283 ns | 0.2135 ns |  0.99 |
|         |        |           |           |           |       |
| Default |     64 |  4.419 ns | 0.1277 ns | 0.1790 ns |  1.00 |
|      PR |     64 |  3.963 ns | 0.0494 ns | 0.0462 ns |  0.87 |
|         |        |           |           |           |       |
| Default |     65 |  6.412 ns | 0.1566 ns | 0.1608 ns |  1.00 |
|      PR |     65 |  4.469 ns | 0.0229 ns | 0.0203 ns |  0.70 |
|         |        |           |           |           |       |
| Default |     95 | 11.318 ns | 0.1033 ns | 0.0966 ns |  1.00 |
|      PR |     95 |  4.502 ns | 0.0543 ns | 0.0508 ns |  0.40 |
|         |        |           |           |           |       |
| Default |    100 |  8.343 ns | 0.1312 ns | 0.1096 ns |  1.00 |
|      PR |    100 |  5.246 ns | 0.1376 ns | 0.1287 ns |  0.63 |

bool Contains(ref char searchSpace, char value, int length)

|  Method | Length |      Mean |     Error |    StdDev | Ratio |
|-------- |------- |----------:|----------:|----------:|------:|
| Default |     31 | 10.349 ns | 0.2485 ns | 0.6325 ns |  1.00 |
|      PR |     31 | 10.232 ns | 0.2426 ns | 0.5998 ns |  0.99 |
|         |        |           |           |           |       |
| Default |     32 |  8.498 ns | 0.1741 ns | 0.1454 ns |  1.00 |
|      PR |     32 |  6.177 ns | 0.1473 ns | 0.1967 ns |  0.71 |
|         |        |           |           |           |       |
| Default |     33 |  8.751 ns | 0.2081 ns | 0.1946 ns |  1.00 |
|      PR |     33 |  4.655 ns | 0.1279 ns | 0.1313 ns |  0.53 |
|         |        |           |           |           |       |
| Default |     47 |  9.665 ns | 0.2301 ns | 0.3373 ns |  1.00 |
|      PR |     47 |  4.563 ns | 0.0421 ns | 0.0374 ns |  0.46 |
|         |        |           |           |           |       |
| Default |    100 |  9.096 ns | 0.1002 ns | 0.0937 ns |  1.00 |
|      PR |    100 |  8.182 ns | 0.1001 ns | 0.0836 ns |  0.90 |

Machine code (x64)

SpanHelpers.Contains(byte)

; SpanHelpersContainsByteBenchmark.Default()
       mov       rdx,[rcx+8]
       movzx     eax,byte ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F51420]
; Total bytes of code 23

; SpanHelpersContainsByteBenchmark.Contains(Byte ByRef, Byte, Int32)
       vzeroupper
       movzx     eax,dl
       mov       edx,eax
       xor       r9d,r9d
       mov       r10d,r8d
       mov       r11,r10
       cmp       r8d,40
       jl        short M01_L00
       mov       r11,rcx
       and       r11,1F
       neg       r11
       add       r11,20
       and       r11,1F
M01_L00:
       cmp       r11,8
       jb        near ptr M01_L02
M01_L01:
       add       r11,0FFFFFFFFFFFFFFF8
       movzx     r8d,byte ptr [rcx+r9]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+1]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+2]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+3]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+4]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+5]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+6]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+7]
       cmp       r8d,edx
       je        near ptr M01_L09
       add       r9,8
       cmp       r11,8
       jae       near ptr M01_L01
M01_L02:
       cmp       r11,4
       jb        short M01_L03
       add       r11,0FFFFFFFFFFFFFFFC
       movzx     r8d,byte ptr [rcx+r9]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+1]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+2]
       cmp       r8d,edx
       je        near ptr M01_L09
       movzx     r8d,byte ptr [rcx+r9+3]
       cmp       r8d,edx
       je        near ptr M01_L09
       add       r9,4
M01_L03:
       test      r11,r11
       je        short M01_L05
M01_L04:
       dec       r11
       movzx     r8d,byte ptr [rcx+r9]
       cmp       r8d,edx
       je        short M01_L09
       inc       r9
       test      r11,r11
       jne       short M01_L04
M01_L05:
       cmp       r9,r10
       jae       short M01_L08
       mov       r11,r10
       sub       r11,r9
       and       r11,0FFFFFFFFFFFFFFE0
       imul      r8d,eax,1010101
       vmovd     xmm0,r8d
       vpbroadcastd ymm0,xmm0
       cmp       r11,r9
       jbe       short M01_L07
       nop       dword ptr [rax]
       nop       dword ptr [rax+rax]
M01_L06:
       vpcmpeqb  ymm1,ymm0,[rcx+r9]
       vxorps    ymm2,ymm2,ymm2
       vpxor     ymm1,ymm2,ymm1
       vptest    ymm1,ymm1
       jne       short M01_L09
       add       r9,20
       cmp       r11,r9
       ja        short M01_L06
M01_L07:
       cmp       r9,r10
       jae       short M01_L08
       mov       r11,r10
       sub       r11,r9
       jmp       near ptr M01_L00
M01_L08:
       xor       eax,eax
       vzeroupper
       ret
M01_L09:
       mov       eax,1
       vzeroupper
       ret
; Total bytes of code 397

; SpanHelpersContainsByteBenchmark.PR()
       mov       rdx,[rcx+8]
       movzx     eax,byte ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F61438]
; Total bytes of code 23

; SpanHelpersContainsByteBenchmark.Contains_PR(Byte ByRef, Byte, Int32)
       push      rdi
       push      rsi
       vzeroupper
       movzx     eax,dl
       mov       edx,eax
       xor       r9d,r9d
       mov       r10d,r8d
       mov       r11,r10
       cmp       r8d,40
       jl        short M01_L00
       mov       r11,rcx
       and       r11,1F
       neg       r11
       add       r11,20
       and       r11,1F
M01_L00:
       cmp       r11,8
       jb        short M01_L02
       nop       dword ptr [rax]
M01_L01:
       add       r11,0FFFFFFFFFFFFFFF8
       lea       rsi,[rcx+r9]
       movzx     edi,byte ptr [rsi]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+1]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+2]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+3]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+4]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+5]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+6]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     esi,byte ptr [rsi+7]
       cmp       edx,esi
       je        near ptr M01_L09
       add       r9,8
       cmp       r11,8
       jae       short M01_L01
M01_L02:
       cmp       r11,4
       jb        short M01_L03
       add       r11,0FFFFFFFFFFFFFFFC
       lea       rsi,[rcx+r9]
       movzx     edi,byte ptr [rsi]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+1]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     edi,byte ptr [rsi+2]
       cmp       edx,edi
       je        near ptr M01_L09
       movzx     esi,byte ptr [rsi+3]
       cmp       edx,esi
       je        near ptr M01_L09
       add       r9,4
M01_L03:
       test      r11,r11
       je        short M01_L05
       nop       dword ptr [rax+rax]
M01_L04:
       dec       r11
       movzx     esi,byte ptr [rcx+r9]
       cmp       esi,edx
       je        short M01_L09
       inc       r9
       test      r11,r11
       jne       short M01_L04
M01_L05:
       cmp       r9,r10
       jae       short M01_L08
       mov       r11,r10
       sub       r11,r9
       and       r11,0FFFFFFFFFFFFFFE0
       vxorps    ymm0,ymm0,ymm0
       imul      eax,1010101
       vmovd     xmm1,eax
       vpbroadcastd ymm1,xmm1
       cmp       r9,r11
       jae       short M01_L07
M01_L06:
       vpcmpeqb  ymm2,ymm1,[rcx+r9]
       vpxor     ymm2,ymm0,ymm2
       vptest    ymm2,ymm2
       jne       short M01_L09
       add       r9,20
       cmp       r9,r11
       jb        short M01_L06
M01_L07:
       cmp       r9,r10
       jae       short M01_L08
       add       r8d,0FFFFFFE0
       mov       r9d,r8d
       vpcmpeqb  ymm2,ymm1,[rcx+r9]
       vpxor     ymm0,ymm0,ymm2
       vptest    ymm0,ymm0
       jne       short M01_L09
M01_L08:
       xor       eax,eax
       vzeroupper
       pop       rsi
       pop       rdi
       ret
M01_L09:
       mov       eax,1
       vzeroupper
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 389

SpanHelpers.Contains(char)

; SpanHelpersContainsCharBenchmark.Default()
       mov       rdx,[rcx+8]
       movzx     eax,word ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F51420]
; Total bytes of code 23

; SpanHelpersContainsCharBenchmark.Contains(Char ByRef, Char, Int32)
       push      rax
       vzeroupper
       xor       eax,eax
       mov       [rsp],rax
       mov       [rsp],rcx
       movsxd    rax,r8d
       lea       r9,[rcx+rax*2]
       cmp       r8d,20
       jl        short M01_L00
       mov       r8d,ecx
       and       r8d,1F
       mov       eax,r8d
       shr       eax,1F
       add       eax,r8d
       sar       eax,1
       mov       r8d,eax
       neg       r8d
       add       r8d,10
       and       r8d,0F
M01_L00:
       cmp       r8d,4
       jl        short M01_L02
       movzx     r10d,dx
M01_L01:
       add       r8d,0FFFFFFFC
       movzx     eax,word ptr [rcx]
       cmp       r10d,eax
       je        near ptr M01_L08
       movzx     eax,word ptr [rcx+2]
       cmp       r10d,eax
       je        near ptr M01_L08
       movzx     eax,word ptr [rcx+4]
       cmp       r10d,eax
       je        near ptr M01_L08
       movzx     eax,word ptr [rcx+6]
       cmp       r10d,eax
       je        near ptr M01_L08
       add       rcx,8
       cmp       r8d,4
       jge       short M01_L01
M01_L02:
       test      r8d,r8d
       jle       short M01_L04
       movzx     r10d,dx
       nop
M01_L03:
       dec       r8d
       movzx     eax,word ptr [rcx]
       cmp       r10d,eax
       je        near ptr M01_L08
       add       rcx,2
       test      r8d,r8d
       jg        short M01_L03
M01_L04:
       cmp       rcx,r9
       jae       short M01_L07
       mov       r8,r9
       sub       r8,rcx
       mov       rax,r8
       shr       rax,3F
       add       rax,r8
       sar       rax,1
       mov       r8d,eax
       and       r8d,0FFFFFFF0
       movzx     r10d,dx
       imul      eax,r10d,10001
       vmovd     xmm0,eax
       vpbroadcastd ymm0,xmm0
       test      r8d,r8d
       jle       short M01_L06
M01_L05:
       vpcmpeqw  ymm1,ymm0,[rcx]
       vxorps    ymm2,ymm2,ymm2
       vpxor     ymm1,ymm2,ymm1
       vptest    ymm1,ymm1
       jne       short M01_L08
       add       rcx,20
       add       r8d,0FFFFFFF0
       test      r8d,r8d
       jg        short M01_L05
M01_L06:
       cmp       rcx,r9
       jae       short M01_L07
       mov       r8,r9
       sub       r8,rcx
       mov       rax,r8
       shr       rax,3F
       add       rax,r8
       sar       rax,1
       mov       r8d,eax
       jmp       near ptr M01_L00
M01_L07:
       xor       eax,eax
       vzeroupper
       add       rsp,8
       ret
M01_L08:
       mov       eax,1
       vzeroupper
       add       rsp,8
       ret
; Total bytes of code 311

; SpanHelpersContainsCharBenchmark.PR()
       mov       rdx,[rcx+8]
       movzx     eax,word ptr [rcx+14]
       mov       r8d,[rcx+10]
       mov       rcx,rdx
       mov       edx,eax
       jmp       qword ptr [7FFB35F51438]
; Total bytes of code 23

; SpanHelpersContainsCharBenchmark.Contains_PR(Char ByRef, Char, Int32)
       push      rsi
       sub       rsp,10
       vzeroupper
       xor       eax,eax
       mov       [rsp+8],rax
       mov       [rsp+8],rcx
       xor       r9d,r9d
       mov       r10d,r8d
       mov       r11,r10
       cmp       r8d,20
       jl        short M01_L00
       mov       r11d,ecx
       and       r11d,1F
       shr       r11d,1
       mov       eax,r11d
       neg       eax
       add       eax,10
       and       eax,0F
       mov       r11d,eax
M01_L00:
       cmp       r11,4
       jb        short M01_L02
       movzx     r8d,dx
M01_L01:
       add       r11,0FFFFFFFFFFFFFFFC
       lea       rax,[rcx+r9*2]
       movzx     esi,word ptr [rax]
       cmp       r8d,esi
       je        near ptr M01_L08
       movzx     esi,word ptr [rax+2]
       cmp       r8d,esi
       je        near ptr M01_L08
       movzx     esi,word ptr [rax+4]
       cmp       r8d,esi
       je        near ptr M01_L08
       movzx     eax,word ptr [rax+6]
       cmp       r8d,eax
       je        near ptr M01_L08
       add       r9,4
       cmp       r11,4
       jae       short M01_L01
M01_L02:
       test      r11,r11
       je        short M01_L04
       movzx     r8d,dx
       nop       dword ptr [rax+rax]
       nop       dword ptr [rax+rax]
M01_L03:
       dec       r11
       movzx     eax,word ptr [rcx+r9*2]
       cmp       eax,r8d
       je        near ptr M01_L08
       inc       r9
       test      r11,r11
       jne       short M01_L03
M01_L04:
       cmp       r9,r10
       jae       short M01_L07
       mov       r11,r10
       sub       r11,r9
       and       r11,0FFFFFFFFFFFFFFF0
       vxorps    ymm0,ymm0,ymm0
       movzx     r8d,dx
       imul      eax,r8d,10001
       vmovd     xmm1,eax
       vpbroadcastd ymm1,xmm1
       cmp       r9,r11
       jae       short M01_L06
       nop       word ptr [rax+rax]
M01_L05:
       vpcmpeqw  ymm2,ymm1,[rcx+r9*2]
       vpxor     ymm2,ymm0,ymm2
       vptest    ymm2,ymm2
       jne       short M01_L08
       add       r9,10
       cmp       r9,r11
       jb        short M01_L05
M01_L06:
       cmp       r9,r10
       jae       short M01_L07
       vpcmpeqw  ymm2,ymm1,[rcx+r10*2+0FFE0]
       vpxor     ymm0,ymm0,ymm2
       vptest    ymm0,ymm0
       jne       short M01_L08
M01_L07:
       xor       eax,eax
       vzeroupper
       add       rsp,10
       pop       rsi
       ret
M01_L08:
       mov       eax,1
       vzeroupper
       add       rsp,10
       pop       rsi
       ret
; Total bytes of code 314

👉 If this looks good, I'd like to look into IndexOf, etc. too.

Author:	gfoidl
Assignees:	-
Labels:	`area-System.Memory`, `community-contribution`
Milestone:	-

danmoseley · 2022-04-03T01:58:55Z

It seems we're missing benchmarks for this? (If so can we add yours?)

VS didn't do this for comments (at least in my setup) automatically :-(

gfoidl · 2022-04-03T11:28:11Z

It seems we're missing benchmarks for this? (If so can we add yours?)

Sure 😃 dotnet/performance#2347

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

EgorBo · 2022-04-03T13:02:39Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

-                    var matches = Vector.Equals(values, LoadVector(ref searchSpace, offset));
-                    if (Vector<byte>.Zero.Equals(matches))
+                    matches = Vector.Equals(values, LoadVector(ref searchSpace, offset));
+                    if (zero.Equals(matches))


looks like there is CQ issue in this pattern:

(feel free to file an issue)

More over you don't need to hoist it - it should not be used

Ah, vec1 == vec2 emits better code.

For the char-overload:

M01_L05: vpcmpeqw ymm2,ymm1,[rcx+r9*2] - vpxor ymm2,ymm0,ymm2 vptest ymm2,ymm2 jne short M01_L08 add r9,10 cmp r9,r11 jb short M01_L05 M01_L06: cmp r9,r10 jae short M01_L07 vpcmpeqw ymm2,ymm1,[rcx+r10*2+0FFE0] - vpxor ymm0,ymm0,ymm2 vptest ymm0,ymm0 jne short M01_L08

Thanks for the hint!
Will create an issue for that --> #67500

… issue Cf. dotnet#67492 (comment)

adamsitnik

LGTM, thank you @gfoidl !

adamsitnik · 2022-04-25T14:37:13Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs


            if (Vector.IsHardwareAccelerated && length >= Vector<byte>.Count * 2)
            {
                lengthToExamine = UnalignedCountVector(ref searchSpace);
            }

-        SequentialScan:


a removal of anything goto-related is always welcomed 👍

adamsitnik · 2022-04-25T14:43:50Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

@@ -411,10 +413,17 @@ public static bool Contains(ref byte searchSpace, byte value, int length)
                    goto Found;
                }

-                if (offset < (nuint)(uint)length)
+                // The total length is at least Vector<byte>.Count, so instead of falling back to a


thank you for adding the comment (otherwise it would not be obvious to me) 👍

AndyAyersMS · 2022-04-27T17:05:21Z

Did we expect any perf regressions from this? Seems like it might be related to dotnet/perf-autofiling-issues#4884

.

gfoidl · 2022-04-30T10:56:50Z

Did we expect any perf regressions from this?

No regression is expected, rather it should be an improvement.

When I check the benchmark-code SpanHelpers.Contains isn't hit?

What is a proper way to investigate this regression?

gfoidl added 2 commits April 2, 2022 22:41

Handle final elements in SpanHelpers.Contains(ref byte, byte, int) ve…

89cc087

…ctorized

Handle final elements in SpanHelpers.Contains(ref char, char, int) ve…

4763c44

…ctorized

ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 2, 2022

dotnet-issue-labeler bot added the area-System.Memory label Apr 2, 2022

This comment was marked as outdated.

Sign in to view

Fixed error SA1028: Code should not contain trailing whitespace

567e692

VS didn't do this for comments (at least in my setup) automatically :-(

gfoidl mentioned this pull request Apr 3, 2022

Added Benchmark for SpanHelpers.Contains {byte, char} dotnet/performance#2347

Open

EgorBo reviewed Apr 3, 2022

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs Show resolved Hide resolved

EgorBo reviewed Apr 3, 2022

View reviewed changes

EgorBo approved these changes Apr 3, 2022

View reviewed changes

Use equality operator instead of Vector<T>.Zero.Equals due to codegen…

debc817

… issue Cf. dotnet#67492 (comment)

gfoidl mentioned this pull request Apr 3, 2022

Vector<T>.Zero.Equals(vec) produces less ideal code than vec == Vector<T>.Zero #67500

Closed

adamsitnik self-assigned this Apr 25, 2022

adamsitnik added the tenet-performance Performance related issue label Apr 25, 2022

adamsitnik approved these changes Apr 25, 2022

View reviewed changes

adamsitnik merged commit 1958c7e into dotnet:main Apr 25, 2022

gfoidl deleted the spanhelpers_final_elements_opt branch April 25, 2022 15:19

AndyAyersMS mentioned this pull request Apr 27, 2022

[Perf] Changes at 4/25/2022 3:15:54 PM dotnet/perf-autofiling-issues#4884

Closed

ghost locked as resolved and limited conversation to collaborators May 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle final elements in SpanHelpers.Contains for byte and char vectorized #67492

Handle final elements in SpanHelpers.Contains for byte and char vectorized #67492

gfoidl commented Apr 2, 2022

ghost commented Apr 2, 2022

Description

Benchmark results

Notes

bool Contains(ref byte searchSpace, byte value, int length)

bool Contains(ref char searchSpace, char value, int length)

Machine code (x64)

This comment was marked as outdated.

danmoseley commented Apr 3, 2022

gfoidl commented Apr 3, 2022

EgorBo Apr 3, 2022

EgorBo Apr 3, 2022

gfoidl Apr 3, 2022 •

edited

Loading

adamsitnik left a comment

adamsitnik Apr 25, 2022

adamsitnik Apr 25, 2022

AndyAyersMS commented Apr 27, 2022 •

edited

Loading

gfoidl commented Apr 30, 2022

Handle final elements in SpanHelpers.Contains for byte and char vectorized #67492

Handle final elements in SpanHelpers.Contains for byte and char vectorized #67492

Conversation

gfoidl commented Apr 2, 2022

Description

Benchmark results

Notes

bool Contains(ref byte searchSpace, byte value, int length)

bool Contains(ref char searchSpace, char value, int length)

Machine code (x64)

ghost commented Apr 2, 2022

Description

Benchmark results

Notes

bool Contains(ref byte searchSpace, byte value, int length)

bool Contains(ref char searchSpace, char value, int length)

Machine code (x64)

This comment was marked as outdated.

danmoseley commented Apr 3, 2022

gfoidl commented Apr 3, 2022

EgorBo Apr 3, 2022

Choose a reason for hiding this comment

EgorBo Apr 3, 2022

Choose a reason for hiding this comment

gfoidl Apr 3, 2022 • edited Loading

Choose a reason for hiding this comment

adamsitnik left a comment

Choose a reason for hiding this comment

adamsitnik Apr 25, 2022

Choose a reason for hiding this comment

adamsitnik Apr 25, 2022

Choose a reason for hiding this comment

AndyAyersMS commented Apr 27, 2022 • edited Loading

gfoidl commented Apr 30, 2022

gfoidl Apr 3, 2022 •

edited

Loading

AndyAyersMS commented Apr 27, 2022 •

edited

Loading