Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized string.Replace(char, char) #67049

Merged
merged 13 commits into from
Aug 17, 2022

Conversation

gfoidl
Copy link
Member

@gfoidl gfoidl commented Mar 23, 2022

Description

The vectorized path uses Vector<ushort>, so on x64-cpu with AVX2 available (and enabled) the vectorized path processes 16 chars / ushorts at once.
For the remainder there are [0, Vector<ushort>.Count) elements left which are currently processed in a scalar way.
As the operation is idempotent, we can avoid this scalar processing by just processing the final Vector.Count elements vectorized again. We do this anyway to avoid an additional branch.

Also streamlined the beginning of the method by avoiding multiple return, thus collapsing the epilogs.

asm (excerpt) before
       push      r15
       push      r14
       push      r12
       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,20
       vzeroupper
       mov       rsi,rcx
       movzx     edi,dx
       movzx     ebx,r8w
       cmp       edi,ebx
       jne       short M01_L00
       mov       rax,rsi
       vzeroupper
       add       rsp,20
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r12
       pop       r14
       pop       r15
       ret
M01_L00:
       mov       r8,[rsi+8]
       lea       rcx,[r8+0C]
       mov       r8d,[r8+8]
       mov       edx,edi
       call      System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)
       mov       ebp,eax
       test      ebp,ebp
       jge       short M01_L01
       mov       rax,rsi
       vzeroupper
       add       rsp,20
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r12
       pop       r14
       pop       r15
       ret
asm (excerpt) after
       push      r15
       push      r14
       push      r12
       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,20
       vzeroupper
       mov       rsi,rcx
       movzx     edi,dx
       movzx     ebx,r8w
       cmp       edi,ebx
       je        short M01_L00
       mov       r8,[rsi+8]
       lea       rcx,[r8+0C]
       mov       r8d,[r8+8]
       mov       edx,edi
       call      System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)
       mov       ebp,eax
       test      ebp,ebp
       jge       short M01_L01
   M01_L00:
       mov       rax,rsi
       vzeroupper
       add       rsp,20
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r12
       pop       r14
       pop       r15
       ret

And instead of moving the source-ref, dest-ref forward, and decreasing the reaminingCount, direct indexing is used -- thus eliding two additions and making the loop bodies smaller.

asm (excerpt) before
M01_L03:
       vmovupd   ymm2,[rcx]
       vpcmpeqw  ymm3,ymm2,ymm0
       vpand     ymm4,ymm1,ymm3
       vpandn    ymm2,ymm3,ymm2
       vpor      ymm2,ymm4,ymm2
       vmovupd   [rdx],ymm2
       add       rcx,20
       add       rdx,20
       add       r14d,0FFFFFFF0
       cmp       r14d,10
       jge       short M01_L03
asm (excerpt) after
M01_L03:
       lea       r9,[rcx+rcx]
       vmovupd   ymm2,[rax+r9]
       vpcmpeqw  ymm3,ymm2,ymm0
       vpand     ymm4,ymm1,ymm3
       vpandn    ymm2,ymm3,ymm2
       vpor      ymm2,ymm4,ymm2
       vmovupd   [rdx+r9],ymm2
       add       rcx,10
       cmp       rcx,r8
       jle       short M01_L03

Benchmark

info
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1586 (21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.2.22153.17
  [Host]     : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT
|  Method | Length |     Mean |    Error |   StdDev | Ratio | RatioSD |
|-------- |------- |---------:|---------:|---------:|------:|--------:|
| Default |     16 | 19.26 ns | 0.443 ns | 0.435 ns |  1.00 |    0.00 |
|      PR |     16 | 19.76 ns | 0.252 ns | 0.224 ns |  1.03 |    0.03 |
|         |        |          |          |          |       |         |
| Default |     17 | 19.62 ns | 0.306 ns | 0.287 ns |  1.00 |    0.00 |
|      PR |     17 | 19.77 ns | 0.161 ns | 0.134 ns |  1.01 |    0.02 |
|         |        |          |          |          |       |         |
| Default |     24 | 26.78 ns | 0.326 ns | 0.289 ns |  1.00 |    0.00 |
|      PR |     24 | 20.58 ns | 0.318 ns | 0.282 ns |  0.77 |    0.01 |
|         |        |          |          |          |       |         |
| Default |     31 | 33.39 ns | 0.321 ns | 0.268 ns |  1.00 |    0.00 |
|      PR |     31 | 21.32 ns | 0.346 ns | 0.307 ns |  0.64 |    0.01 |
|         |        |          |          |          |       |         |
| Default |     32 | 21.74 ns | 0.305 ns | 0.271 ns |  1.00 |    0.00 |
|      PR |     32 | 22.38 ns | 0.453 ns | 0.539 ns |  1.03 |    0.03 |

There is quite a huge improvement when some elements are remaining, but a litte regression if no or just a few elements are remaining.
But in absolute numbers that regression is tiny.

Open questions

Right now Vector<ushort> is used.
This means that on hardware where 256 bit SIMD is supported 16 elements are processed at once. On hardware where only 128 bit SIMD is supported, it's 8 elements.

There is the debate / consideration about Vector<T>, Vector128<T>, Vector256<T> as xplat-intrinsics around, so here arises the question should the code be update to use Vector128<T> or Vector256<T> instead of Vector<T>?
In my opinion Vector<T> is a good choice here and should be kept as

  • it works xplat with a single code-path
  • with Vector128 it's "only" 8 chars at once, when it could be 16 -- so there should be a extra path for Vector256, complicating the code
  • I assume for string.Replace(char, char) a length of >= 16 is quite common, so it's best when AVX2 is available

I checked the codegen for Vector128 and Vector256, it emits the same / similar code (that's why #67039, #67038 got opened).
The only question is for the count of elements processed at once, and potential warm-up, etc. that comes with AVX.

@danmoseley
Copy link
Member

Nice - could you maybe get measurements for longer strings? I would imagine they are fairly common, and this change looks even better there.

I see our perf coverage is missing longer string(s) and probably should have one.
https://github.com/dotnet/performance/blob/7d5a03bc880a27b19cb61449aaa269d1c8f4f8e5/src/benchmarks/micro/libraries/System.Runtime/Perf.String.cs#L157-L160

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Mar 23, 2022
@gfoidl
Copy link
Member Author

gfoidl commented Mar 23, 2022

measurements for longer strings?

|  Method | Length |      Mean |    Error |    StdDev |    Median | Ratio | RatioSD |
|-------- |------- |----------:|---------:|----------:|----------:|------:|--------:|
| Default |     16 |  22.82 ns | 0.571 ns |  1.676 ns |  22.41 ns |  1.00 |    0.00 |
|      PR |     16 |  21.82 ns | 0.510 ns |  1.030 ns |  21.70 ns |  0.96 |    0.07 |
|         |        |           |          |           |           |       |         |
| Default |     17 |  21.68 ns | 0.500 ns |  0.632 ns |  21.76 ns |  1.00 |    0.00 |
|      PR |     17 |  21.19 ns | 0.380 ns |  0.337 ns |  21.17 ns |  0.99 |    0.03 |
|         |        |           |          |           |           |       |         |
| Default |     24 |  28.24 ns | 0.317 ns |  0.281 ns |  28.21 ns |  1.00 |    0.00 |
|      PR |     24 |  22.37 ns | 0.362 ns |  0.321 ns |  22.36 ns |  0.79 |    0.01 |
|         |        |           |          |           |           |       |         |
| Default |     31 |  37.02 ns | 0.734 ns |  0.686 ns |  36.93 ns |  1.00 |    0.00 |
|      PR |     31 |  22.95 ns | 0.477 ns |  0.423 ns |  23.03 ns |  0.62 |    0.01 |
|         |        |           |          |           |           |       |         |
| Default |     32 |  22.63 ns | 0.506 ns |  0.497 ns |  22.44 ns |  1.00 |    0.00 |
|      PR |     32 |  23.76 ns | 0.527 ns |  0.493 ns |  23.76 ns |  1.05 |    0.03 |
|         |        |           |          |           |           |       |         |
| Default |    100 |  37.25 ns | 0.736 ns |  0.614 ns |  37.27 ns |  1.00 |    0.00 |
|      PR |    100 |  34.26 ns | 0.628 ns |  0.557 ns |  34.32 ns |  0.92 |    0.01 |
|         |        |           |          |           |           |       |         |
| Default |    500 | 104.41 ns | 2.088 ns |  4.881 ns | 102.73 ns |  1.00 |    0.00 |
|      PR |    500 |  97.96 ns | 1.858 ns |  2.837 ns |  98.46 ns |  0.93 |    0.06 |
|         |        |           |          |           |           |       |         |
| Default |   1000 | 201.91 ns | 4.234 ns | 12.484 ns | 199.50 ns |  1.00 |    0.00 |
|      PR |   1000 | 185.13 ns | 3.775 ns |  7.363 ns | 183.77 ns |  0.92 |    0.07 |

Looks better in general, now only one regression in the len 32 case, but tiny in absolute time.

I see our perf coverage is missing longer string(s) and probably should have one.

Will add longer strings there.

@EgorBo
Copy link
Member

EgorBo commented Mar 23, 2022

question should the code be update to use Vector128 or Vector256 instead of Vector?

If we remove Vector path we'll regress Mono that supports Vector128 only in LLVM mode

In my opinion Vector is a good choice here and should be kept as

As you noted - It doesn't handle small inputs well when AVX is available 😢 e.g. #64899 (comment)


if (firstIndex < 0)
int firstIndex;
if (oldChar == newChar || (firstIndex = IndexOf(oldChar)) < 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it against the guidelines to perform an assignment inside an if ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, but it gives nice machine code here 😉
It's about collapsing the epilogs for the first checks (oldChar == newChar, and firstIndex < 0).
Maybe I think too complicated now, but the other option would be using goto for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least it would be good to have a comment calling out the assignment why it is being done here.

Otherwise, at a glance it may look like a potential bug or comparison using ==

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, the below is also the more "natural" pattern and more readable:

if (oldChar == newChar)
{
    return this;
}

int firstIndex = IndexOf(oldChar);

if (firstIndex < 0)
{
    return this;
}

Ideally the JIT would handle such a pattern "correctly" and optimize it down accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding so what's my action here?

  • add the comment explaining that the epilogs get collapsed in generated machine-code
  • write it more naturally and file an issue for the JIT
  • do both, and write it naturally once the JIT issue is fixed

(I'm leaning towards the last option, for perf-reasons -- except JIT issue will be fixed for .NET 7 😉)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think writing it naturally and filing an issue for the JIT is the best choice and don't expect the cost to be significant here.

If the cost is more significant, then adding a comment calling out the assignment and why as well as filing an issue for the JIT is the next best option.

If the issue is actually being fixed for .NET 7, that's all the more reason to do the first approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the codegen issue tracked by #8883?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change here (back to where it was) with 30889ac
If I read the issue from the previous comment correct, so this should cover that case.

ref ushort pSrc = ref Unsafe.Add(ref Unsafe.As<char, ushort>(ref _firstChar), copyLength);
ref ushort pDst = ref Unsafe.Add(ref Unsafe.As<char, ushort>(ref result._firstChar), copyLength);
ref ushort pSrc = ref Unsafe.Add(ref Unsafe.As<char, ushort>(ref _firstChar), (nint)(uint)copyLength);
ref ushort pDst = ref Unsafe.Add(ref Unsafe.As<char, ushort>(ref result._firstChar), (nint)(uint)copyLength);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off-topic: gosh, the same line with raw pointers is basically

ushort* pDst = ((ushort*)result._firstChar)[copyLength]

"Safe" Unsafe is killing me 🤦‍♂️

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here 🙈 It reads (and writes) like a mess.
Pinning wasn't used here, so I didn't use it too -- also I expect a little bit of regression then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might become slightly more readable if the Unsafe.As was the outermost.

We could also expose an internal only Vector.LoadUnsafe(ref T, nuint index) API which would also simplify things here.

// Thus we can eliminate the scalar processing of the remaining elements.
// We perform this operation even if there are 0 elements remaining, as it is cheaper than the
// additional check which would introduce a branch here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth adding an assert that current Debug.Assert(this.Length - i <= Vector<ushort>.Count) to make sure we won't skip any data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think in this case a test should fail?
I'll re-check the tests and make sure that case is covered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests cover these cases, so I don't see a need for the Debug.Assert -- but I'll add it of course if you want.

// -------------------- For Vector<ushort>.Count == 8 (SSE2 / ARM NEON) --------------------
[InlineData("Aaaaaaaa", 'A', 'a', "aaaaaaaa")] // Single iteration of vectorised path; no remainders through non-vectorised path
// Three leading 'a's before a match (copyLength > 0), Single iteration of vectorised path; no remainders through non-vectorised path
[InlineData("aaaAaaaaaaa", 'A', 'a', "aaaaaaaaaaa")]
// Single iteration of vectorised path; 3 remainders through non-vectorised path
[InlineData("AaaaaaaaaAa", 'A', 'a', "aaaaaaaaaaa")]
// ------------------------- For Vector<ushort>.Count == 16 (AVX2) -------------------------
[InlineData("AaaaaaaaAaaaaaaa", 'A', 'a', "aaaaaaaaaaaaaaaa")] // Single iteration of vectorised path; no remainders through non-vectorised path
// Three leading 'a's before a match (copyLength > 0), Single iteration of vectorised path; no remainders through non-vectorised path
[InlineData("aaaAaaaaaaaAaaaaaaa", 'A', 'a', "aaaaaaaaaaaaaaaaaaa")]
// Single iteration of vectorised path; 3 remainders through non-vectorised path
[InlineData("AaaaaaaaAaaaaaaaaAa", 'A', 'a', "aaaaaaaaaaaaaaaaaaa")]
// ----------------------------------- General test data -----------------------------------

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks! cc @stephentoub @tannergooding

@tannergooding
Copy link
Member

If we remove Vector path we'll regress Mono that supports Vector128 only in LLVM mode

@fanyang-mono is the work being done in Mono only covering Vector64/128 for ARM64 and only in LLVM or does it also extend to Mono JIT and x86/x64?

@ghost
Copy link

ghost commented Mar 24, 2022

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

The vectorized path uses Vector<ushort>, so on x64-cpu with AVX2 available (and enabled) the vectorized path processes 16 chars / ushorts at once.
For the remainder there are [0, Vector<ushort>.Count) elements left which are currently processed in a scalar way.
As the operation is idempotent, we can avoid this scalar processing by just processing the final Vector.Count elements vectorized again. We do this anyway to avoid an additional branch.

Also streamlined the beginning of the method by avoiding multiple return, thus collapsing the epilogs.

asm (excerpt) before
       push      r15
       push      r14
       push      r12
       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,20
       vzeroupper
       mov       rsi,rcx
       movzx     edi,dx
       movzx     ebx,r8w
       cmp       edi,ebx
       jne       short M01_L00
       mov       rax,rsi
       vzeroupper
       add       rsp,20
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r12
       pop       r14
       pop       r15
       ret
M01_L00:
       mov       r8,[rsi+8]
       lea       rcx,[r8+0C]
       mov       r8d,[r8+8]
       mov       edx,edi
       call      System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)
       mov       ebp,eax
       test      ebp,ebp
       jge       short M01_L01
       mov       rax,rsi
       vzeroupper
       add       rsp,20
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r12
       pop       r14
       pop       r15
       ret
asm (excerpt) after
       push      r15
       push      r14
       push      r12
       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,20
       vzeroupper
       mov       rsi,rcx
       movzx     edi,dx
       movzx     ebx,r8w
       cmp       edi,ebx
       je        short M01_L00
       mov       r8,[rsi+8]
       lea       rcx,[r8+0C]
       mov       r8d,[r8+8]
       mov       edx,edi
       call      System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)
       mov       ebp,eax
       test      ebp,ebp
       jge       short M01_L01
   M01_L00:
       mov       rax,rsi
       vzeroupper
       add       rsp,20
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r12
       pop       r14
       pop       r15
       ret

And instead of moving the source-ref, dest-ref forward, and decreasing the reaminingCount, direct indexing is used -- thus eliding two additions and making the loop bodies smaller.

asm (excerpt) before
M01_L03:
       vmovupd   ymm2,[rcx]
       vpcmpeqw  ymm3,ymm2,ymm0
       vpand     ymm4,ymm1,ymm3
       vpandn    ymm2,ymm3,ymm2
       vpor      ymm2,ymm4,ymm2
       vmovupd   [rdx],ymm2
       add       rcx,20
       add       rdx,20
       add       r14d,0FFFFFFF0
       cmp       r14d,10
       jge       short M01_L03
asm (excerpt) after
M01_L03:
       lea       r9,[rcx+rcx]
       vmovupd   ymm2,[rax+r9]
       vpcmpeqw  ymm3,ymm2,ymm0
       vpand     ymm4,ymm1,ymm3
       vpandn    ymm2,ymm3,ymm2
       vpor      ymm2,ymm4,ymm2
       vmovupd   [rdx+r9],ymm2
       add       rcx,10
       cmp       rcx,r8
       jle       short M01_L03

Benchmark

info
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1586 (21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.2.22153.17
  [Host]     : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT
|  Method | Length |     Mean |    Error |   StdDev | Ratio | RatioSD |
|-------- |------- |---------:|---------:|---------:|------:|--------:|
| Default |     16 | 19.26 ns | 0.443 ns | 0.435 ns |  1.00 |    0.00 |
|      PR |     16 | 19.76 ns | 0.252 ns | 0.224 ns |  1.03 |    0.03 |
|         |        |          |          |          |       |         |
| Default |     17 | 19.62 ns | 0.306 ns | 0.287 ns |  1.00 |    0.00 |
|      PR |     17 | 19.77 ns | 0.161 ns | 0.134 ns |  1.01 |    0.02 |
|         |        |          |          |          |       |         |
| Default |     24 | 26.78 ns | 0.326 ns | 0.289 ns |  1.00 |    0.00 |
|      PR |     24 | 20.58 ns | 0.318 ns | 0.282 ns |  0.77 |    0.01 |
|         |        |          |          |          |       |         |
| Default |     31 | 33.39 ns | 0.321 ns | 0.268 ns |  1.00 |    0.00 |
|      PR |     31 | 21.32 ns | 0.346 ns | 0.307 ns |  0.64 |    0.01 |
|         |        |          |          |          |       |         |
| Default |     32 | 21.74 ns | 0.305 ns | 0.271 ns |  1.00 |    0.00 |
|      PR |     32 | 22.38 ns | 0.453 ns | 0.539 ns |  1.03 |    0.03 |

There is quite a huge improvement when some elements are remaining, but a litte regression if no or just a few elements are remaining.
But in absolute numbers that regression is tiny.

Open questions

Right now Vector<ushort> is used.
This means that on hardware where 256 bit SIMD is supported 16 elements are processed at once. On hardware where only 128 bit SIMD is supported, it's 8 elements.

There is the debate / consideration about Vector<T>, Vector128<T>, Vector256<T> as xplat-intrinsics around, so here arises the question should the code be update to use Vector128<T> or Vector256<T> instead of Vector<T>?
In my opinion Vector<T> is a good choice here and should be kept as

  • it works xplat with a single code-path
  • with Vector128 it's "only" 8 chars at once, when it could be 16 -- so there should be a extra path for Vector256, complicating the code
  • I assume for string.Replace(char, char) a length of >= 16 is quite common, so it's best when AVX2 is available

I checked the codegen for Vector128 and Vector256, it emits the same / similar code (that's why #67039, #67038 got opened).
The only question is for the count of elements processed at once, and potential warm-up, etc. that comes with AVX.

Author: gfoidl
Assignees: -
Labels:

area-System.Runtime, community-contribution

Milestone: -

@fanyang-mono
Copy link
Member

fanyang-mono commented Mar 24, 2022

If we remove Vector path we'll regress Mono that supports Vector128 only in LLVM mode

@fanyang-mono is the work being done in Mono only covering Vector64/128 for ARM64 and only in LLVM or does it also extend to Mono JIT and x86/x64?

On arm64 SIMD is only supported when using the LLVM backend. On x86/x64, SIMD is supported either when LLVM is on or when LLVM is off and using Vector128. The code lives here:
https://github.com/dotnet/runtime/blob/main/src/mono/mono/mini/mini.h#L316-L326

@EgorBo
Copy link
Member

EgorBo commented Mar 24, 2022

On arm64 SIMD is only supported when using the LLVM backend. On x86/x64, SIMD is supported either when LLVM is on or when LLVM is off and using Vector128. The code lives here: https://github.com/dotnet/runtime/blob/main/src/mono/mono/mini/mini.h#L316-L326

The main question basically is - can we replace Vector<T> implementation with Vector128?
So from my understanding on the main mono targets (mobiles) Vector<T> is not supported without LLVM so I assume we can replace it since with LLVM both Vector<T> and Vector128 are supported everywhere?

@gfoidl
Copy link
Member Author

gfoidl commented Mar 24, 2022

Besides that do we have any data about typical ranges of string length where chars should be replaced? (from api metrics, etc)
With Vector128 it's only 8 chars at once. Will be good for short strings, but I expect it will regress on longer strings as Vector256 processes 16 chars at once.

@EgorBo
Copy link
Member

EgorBo commented Mar 24, 2022

btw, I've just checked - Mono on x64 (Linux) doesn't support Vector128 without LLVM. It prints false for all Sse*.IsSupported

while Vector<T> works just fine

@tannergooding
Copy link
Member

With Vector128 it's only 8 chars at once. Will be good for short strings, but I expect it will regress on longer strings as Vector256 processes 16 chars at once.

It's going to depend on hardware, among other things.

Typically, well-optimized C/C++ code defines the cutoff as around 128 to 256-bytes (2-4 cache lines) and will optionally use V256 in the cases where the data is bigger than this. A lot of our C# code defines a cutoff of 64-bytes instead (1 cache line).

@tannergooding
Copy link
Member

-- Where I'm defining "well-optimized" as some of the core memory and string functions provided by the C Runtime for MSVC, GCC, and Clang as they are some of the most critical, most profiled, and most deeply invested in pieces of code that are vectorized across the board.

@fanyang-mono
Copy link
Member

btw, I've just checked - Mono on x64 (Linux) doesn't support Vector128 without LLVM. It prints false for all Sse*.IsSupported

while Vector<T> works just fine

This is different than what I posted earlier. Let me double check.

@fanyang-mono
Copy link
Member

Mono on x64 (Linux) doesn't support Vector128 without LLVM, while Vector works just fine.

It is a little bit confusing to say that if Vector or Vector128 is supported on a specific platform. Because mono is still in the process of adding SIMD intrinsics for the methods. Recently, we just added some for arm64 and some for amd64.

Regarding to what is concerned to this PR, these are the methods that we would like to know if they are intrinsified/supported:

Vector.IsHardwareAccelerated
Vector<ushort>.Count
Vector.Equals
Vector.ConditionalSelect

And

Vector128.IsHardwareAccelerated
Vector128<ushort>.Count
Vector128.Equals
Vector128.ConditionalSelect

Currently, on amd64 some SIMD instructions were added for Vector128 but there isn't any methods being intrinsified without LLVM. With LLVM, Vector128.Equals and Vector128<ushort>.Count are currently supported to emit SIMD intrinsics. Support of Vector128.IsHardwareAccelerated could be easily added as well.

On amd64, Vector.Equals and Vector.ConditionalSelect are not intrinsified yet, but the other two are supported to emit SIMD instructions. This is true for both with and without LLVM.

In summary, currently, neither Vector128 nor Vector is be fully accelerated for Mono, in terms of this PR. We should choose based on theoretical excellence. The full SIMD intrinsics support will be there one day, hopefully soon.

When the remaining length is a multiple of the vector size, then the remainder is processed twice. This is redundant, and not needed.
This commit changes that, so that the remainder is processed only once when the remaining elements match.
@danmoseley danmoseley closed this Aug 16, 2022
@danmoseley danmoseley reopened this Aug 16, 2022
@danmoseley danmoseley merged commit eca5b44 into dotnet:main Aug 17, 2022
@danmoseley
Copy link
Member

Thanks @gfoidl

@danmoseley
Copy link
Member

After discussion with @stephentoub , I'm going to backport this to .NET 7. While not meeting the letter of the criteria, it really should have gone in yesterday, before the snap, but CI was broken; and this was open for months, a with fair bit of waiting on us.

@danmoseley
Copy link
Member

/backport to release/7.0-rc1

@github-actions
Copy link
Contributor

Started backporting to release/7.0-rc1: https://github.com/dotnet/runtime/actions/runs/2872287360

github-actions bot pushed a commit that referenced this pull request Aug 17, 2022
@gfoidl gfoidl deleted the string-replace-indempotency branch August 17, 2022 14:00
danmoseley pushed a commit that referenced this pull request Aug 17, 2022
* Optimized string.Replace(char, char) vector code path

* Optimized code pathes even further

* Do vectorized operation at the end of the string only once

When the remaining length is a multiple of the vector size, then the remainder is processed twice. This is redundant, and not needed.
This commit changes that, so that the remainder is processed only once when the remaining elements match.

* Don't use trick for collapsed epilogs

Cf. #67049 (comment)

* Handle remainder vectorized even if remainingLength <= Vector<ushort>.Count and added tests for this

* Introduce (internal) Vector.LoadUnsafe and Vector.StoreUnsafe and use it in string.Replace(char, char)

* Avoid Unsafe.As<char, ushort> reinterpret casts by introducing string.GetRawStringDataAsUshort() internal method

* Fixed copy/paste error (from local dev to repo)

* PR Feedback

* Fixed bug and added tests for this

* Make condition about lengthToExamine clearer as suggested

Co-authored-by: Günther Foidl <[email protected]>
@stephentoub
Copy link
Member

@gfoidl, at least on my machine, comparing string.Replace in .NET 6 vs .NET 7, multiple examples I've tried have shown .NET 7 to have regressed, e.g.

const string Input = """
    Whose woods these are I think I know.
    His house is in the village though;
    He will not see me stopping here
    To watch his woods fill up with snow.
    My little horse must think it queer
    To stop without a farmhouse near
    Between the woods and frozen lake
    The darkest evening of the year.
    He gives his harness bells a shake
    To ask if there is some mistake.
    The only other sound’s the sweep
    Of easy wind and downy flake.
    The woods are lovely, dark and deep,
    But I have promises to keep,
    And miles to go before I sleep,
    And miles to go before I sleep.
    """;

[Benchmark]
public string Replace() => Input.Replace('I', 'U');
Method Runtime Mean Ratio
Replace .NET 6.0 108.1 ns 1.00
Replace .NET 7.0 136.0 ns 1.26

Do you see otherwise?

@gfoidl
Copy link
Member Author

gfoidl commented Aug 28, 2022

Hm, that is not expected...

When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1889 (21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.7.22377.5
  [Host]     : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT

Method Mean Error StdDev Median Ratio RatioSD
Default 142.0 ns 3.48 ns 9.98 ns 138.6 ns 1.00 0.00
PR 132.9 ns 2.68 ns 3.40 ns 132.8 ns 0.92 0.07

so a result I'd expect, as after the vectorized loop 6 chars are remaining that the old-code processes in the for-loop whilst the new-code does one vectorized pass.

I checked the dasm (via DisassemblyDiagnoser of BDN) and that looks OK.

Can this be something from different machine-code layout (loops), PGO, etc. that causes the difference between .NET 6 and .NET 7?
How can I investigate this further -- need some guidance on how to check code-layout please.

@stephentoub
Copy link
Member

stephentoub commented Aug 28, 2022

Thanks, @gfoidl. Do you see a similar 6 vs 7 difference as I do? (It might not be specific to this PR.) @EgorBo, can you advise?

@tannergooding
Copy link
Member

When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see

This could be related to stale PGO data

@danmoseley
Copy link
Member

Is there POGO data en-route that has trained with this change in place? I am not sure how to follow it.

@danmoseley
Copy link
Member

Also, it wouldn't matter here, but are we consuming POGO data trained on main bits in the release branches?

@stephentoub
Copy link
Member

stephentoub commented Aug 28, 2022

I don't think this particular case is related to stale PGO data. I set COMPlus_JitDisablePGO=1, and I still see an ~20% regression from .NET 6 to .NET 7.

@danmoseley
Copy link
Member

danmoseley commented Aug 28, 2022

I ran the example above with

        var config = DefaultConfig.Instance
            .AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
            .AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
            .AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
            .AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
            .AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"))
            .AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"))
            .AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0").AsBaseline())
            .AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"));
        BenchmarkRunner.Run(typeof(Program).Assembly, args: args, config: config);

and got

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.856/21H2)
Intel Core i7-10510U CPU 1.80GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-rc.2.22426.5
  [Host]     : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2
  Job-DGTURM : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2
  Job-PYGDYG : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2
  Job-ZEPFOF : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2
  Job-PSEWWK : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256
  Job-WGVIGL : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2
  Job-HBSVYM : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2
  Job-VWWZUC : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2
  Job-LDCOEC : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256


|  Method |    EnvironmentVariables |            Runtime |     Mean |    Error |   StdDev |   Median | Ratio | RatioSD |   Gen0 | Allocated | Alloc Ratio |
|-------- |------------------------ |------------------- |---------:|---------:|---------:|---------:|------:|--------:|-------:|----------:|------------:|
| Replace | COMPlus_JitDisablePGO=0 |           .NET 6.0 | 130.5 ns |  6.76 ns | 18.51 ns | 124.0 ns |  0.92 |    0.17 | 0.3269 |   1.34 KB |        1.00 |
| Replace | COMPlus_JitDisablePGO=0 |           .NET 7.0 | 144.0 ns |  2.95 ns |  5.54 ns | 142.5 ns |  1.00 |    0.00 | 0.3271 |   1.34 KB |        1.00 |
| Replace | COMPlus_JitDisablePGO=0 |      .NET Core 3.1 | 822.1 ns | 16.09 ns | 23.07 ns | 814.0 ns |  5.69 |    0.31 | 0.3262 |   1.34 KB |        1.00 |
| Replace | COMPlus_JitDisablePGO=0 | .NET Framework 4.8 | 750.2 ns | 28.86 ns | 82.82 ns | 730.3 ns |  4.97 |    0.49 | 0.3262 |   1.34 KB |        1.00 |
| Replace | COMPlus_JitDisablePGO=1 |           .NET 6.0 | 127.1 ns |  2.64 ns |  4.75 ns | 126.4 ns |  0.88 |    0.05 | 0.3269 |   1.34 KB |        1.00 |
| Replace | COMPlus_JitDisablePGO=1 |           .NET 7.0 | 144.5 ns |  2.96 ns |  5.97 ns | 144.1 ns |  1.01 |    0.06 | 0.3271 |   1.34 KB |        1.00 |
| Replace | COMPlus_JitDisablePGO=1 |      .NET Core 3.1 | 936.2 ns | 17.96 ns | 22.06 ns | 931.9 ns |  6.50 |    0.37 | 0.3262 |   1.34 KB |        1.00 |
| Replace | COMPlus_JitDisablePGO=1 | .NET Framework 4.8 | 673.2 ns | 12.41 ns | 23.91 ns | 670.5 ns |  4.68 |    0.23 | 0.3262 |   1.34 KB |        1.00 |

code https://gist.github.com/danmoseley/c31bc023d6ec671efebff7352e3b3251

(should we be surprised that disabling PGO didn't make any difference? perhaps it doesn't exercise this method? cc @AndyAyersMS )

@danmoseley
Copy link
Member

and just for interest


|  Method |                                EnvironmentVariables |  Runtime |       Mean |    Error |    StdDev |     Median | Ratio | RatioSD |   Gen0 | Allocated | Alloc Ratio |
|-------- |---------------------------------------------------- |--------- |-----------:|---------:|----------:|-----------:|------:|--------:|-------:|----------:|------------:|
| Replace |                             COMPlus_JitDisablePGO=1 | .NET 6.0 |   127.8 ns |  2.55 ns |   5.91 ns |   125.8 ns |  0.95 |    0.05 | 0.3266 |   1.34 KB |        1.00 |
| Replace |                             COMPlus_JitDisablePGO=1 | .NET 7.0 |   141.0 ns |  2.73 ns |   2.42 ns |   141.1 ns |  1.00 |    0.00 | 0.3271 |   1.34 KB |        1.00 |
| Replace |        COMPlus_JitDisablePGO=1,COMPlus_EnableAVX2=0 | .NET 6.0 |   163.9 ns |  3.35 ns |   4.81 ns |   163.8 ns |  1.15 |    0.05 | 0.3269 |   1.34 KB |        1.00 |
| Replace |        COMPlus_JitDisablePGO=1,COMPlus_EnableAVX2=0 | .NET 7.0 |   184.9 ns |  3.59 ns |   4.79 ns |   183.7 ns |  1.32 |    0.05 | 0.3271 |   1.34 KB |        1.00 |
| Replace |         COMPlus_JitDisablePGO=1,COMPlus_EnableAVX=0 | .NET 6.0 |   176.1 ns |  3.44 ns |   4.09 ns |   175.9 ns |  1.25 |    0.03 | 0.3269 |   1.34 KB |        1.00 |
| Replace |         COMPlus_JitDisablePGO=1,COMPlus_EnableAVX=0 | .NET 7.0 |   192.1 ns |  3.81 ns |   4.53 ns |   190.1 ns |  1.37 |    0.05 | 0.3271 |   1.34 KB |        1.00 |
| Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableHWIntrinsic=0 | .NET 6.0 | 1,057.4 ns | 20.95 ns |  40.86 ns | 1,047.2 ns |  7.65 |    0.35 | 0.3262 |   1.34 KB |        1.00 |
| Replace | COMPlus_JitDisablePGO=1,COMPlus_EnableHWIntrinsic=0 | .NET 7.0 |   947.1 ns | 13.34 ns |  11.83 ns |   948.3 ns |  6.72 |    0.15 | 0.3262 |   1.34 KB |        1.00 |
| Replace |        COMPlus_JitDisablePGO=1,COMPlus_EnableSSE3=0 | .NET 6.0 |   496.0 ns | 51.61 ns | 152.17 ns |   463.3 ns |  3.67 |    1.67 | 0.3269 |   1.34 KB |        1.00 |
| Replace |        COMPlus_JitDisablePGO=1,COMPlus_EnableSSE3=0 | .NET 7.0 |   395.3 ns | 14.32 ns |  41.10 ns |   388.4 ns |  2.95 |    0.27 | 0.3271 |   1.34 KB |        1.00 |

@gfoidl
Copy link
Member Author

gfoidl commented Aug 29, 2022

Do you see a similar 6 vs 7 difference as I do?

Yes (sorry for slow response, was Sunday...).
@danmoseley thanks for your numbers.

This is the machine code I get (from BDN) when run @danmoseley's benchmark (.NET 7 only). Left there some comments.

; Program.Replace()
       mov       rcx,1C003C090A0
       mov       rcx,[rcx]
       mov       edx,49
       mov       r8d,55
       jmp       qword ptr [7FFEFA7430C0]
; Total bytes of code 30

; System.String.Replace(Char, Char)
       push      r15
       push      r14
       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,28
       vzeroupper
       mov       rsi,rcx
       mov       edi,edx
       mov       ebx,r8d
       movzx     ecx,di
       movzx     r8d,bx
       cmp       ecx,r8d
       je        near ptr M01_L09
       lea       rcx,[rsi+0C]
       mov       r8d,[rsi+8]
       movsx     rdx,di
       call      qword ptr [7FFEFA7433C0]
       mov       ebp,eax
       test      ebp,ebp
       jge       short M01_L00
       mov       rax,rsi                ; uncommon case, could jump to M01_L09 instead
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
M01_L00:
       mov       ecx,[rsi+8]
       sub       ecx,ebp
       mov       r14d,ecx
       mov       ecx,[rsi+8]
       call      System.String.FastAllocateString(Int32)
       mov       r15,rax
       test      ebp,ebp
       jg        near ptr M01_L10       ; should be common path, I don't expect to jump to the end, then back to here
M01_L01:
       mov       eax,ebp
       lea       rax,[rsi+rax*2+0C]
       cmp       [r15],r15b
       mov       edx,ebp
       lea       rdx,[r15+rdx*2+0C]
       xor       ecx,ecx
       cmp       dword ptr [rsi+8],10
       jl        near ptr M01_L07
       movzx     r8d,di
       imul      r8d,10001              ; this is tracked in https://github.com/dotnet/runtime/issues/67038, .NET 6 has the same issue, so no difference expected
       vmovd     xmm0,r8d
       vpbroadcastd ymm0,xmm0           ; should be vpbroadcastb, see comment above
       movzx     r8d,bx
       imul      r8d,10001
       vmovd     xmm1,r8d
       vpbroadcastd ymm1,xmm1           ; vpbroadcastb (see above)
       cmp       r14,10
       jbe       short M01_L03
       add       r14,0FFFFFFFFFFFFFFF0
M01_L02:
       lea       r8,[rax+rcx*2]
       vmovupd   ymm2,[r8]
       vpcmpeqw  ymm3,ymm2,ymm0
       vpand     ymm4,ymm3,ymm1         ; the vpand, vpandn, vpor series should be vpblendvb, https://github.com/dotnet/runtime/issues/67039 tracked this
       vpandn    ymm2,ymm3,ymm2         ; the "duplicated code for string.Replace" method emits vpblendvb as expected, but
       vpor      ymm2,ymm4,ymm2         ; if string.Replace from .NET 7.0.0 (7.0.22.42212) (.NET SDK=7.0.100-rc.2.22426.5) is used, then it's this series
       lea       r8,[rdx+rcx*2]
       vmovupd   [r8],ymm2
       add       rcx,10
       cmp       rcx,r14
       jb        short M01_L02
M01_L03:
       mov       ecx,[rsi+8]
       add       ecx,0FFFFFFF0
       add       rsi,0C
       lea       rsi,[rsi+rcx*2]
       vmovupd   ymm2,[rsi]
       vpcmpeqw  ymm3,ymm2,ymm0
       vpand     ymm0,ymm3,ymm1
       vpandn    ymm1,ymm3,ymm2
       vpor      ymm2,ymm0,ymm1
       lea       rax,[r15+0C]
       lea       rax,[rax+rcx*2]
       vmovupd   [rax],ymm2
       jmp       short M01_L08
M01_L04:
       movzx     r8d,word ptr [rax+rcx*2]
       lea       r9,[rdx+rcx*2]
       movzx     r10d,di
       cmp       r8d,r10d
       je        short M01_L05          ; not relevant for .NET 6 -> .NET 7 comparison in this test-case, but
       jmp       short M01_L06          ; one jump could be avoided?!
M01_L05:
       movzx     r8d,bx
M01_L06:
       mov       [r9],r8w
       inc       rcx
M01_L07:
       cmp       rcx,r14
       jb        short M01_L04
M01_L08:
       mov       rax,r15
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
M01_L09:                                ; expect the mov rax,{r15,rsi} the epilogs are the same, can they be collapsed to
       mov       rax,rsi                ; get less machine code?
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
M01_L10:                                ; this block should be common enough, so should be on the jump-root (see comment above)
       cmp       [r15],r15b             ; it's the Memmove-call
       lea       rcx,[r15+0C]
       lea       rdx,[rsi+0C]
       mov       r8d,ebp
       add       r8,r8
       call      qword ptr [7FFEFA7399F0]
       jmp       near ptr M01_L01
; Total bytes of code 383

So from code-layout one major difference to .NET 6 is that the call to System.Buffer.Memmove is moved out of the hot-path.
But I doubt that this allone is the cause for the regression.

I also wonder why vpblendvb is gone when using string.Replace in the benchmark from .NET-bits.
If I use a string.Replace-duplicated code for the benchmark, then it's emitted which is what I expect as 10d8a36 got merged on 2022-05-25.
But that shouldn't cause the regression either, as for .NET 6 the same series of vector-instruction are emitted.

The beginning of the method, right after the prolog, looks different between .NET 6 and .NET 7, although this PR didn't change anything here. I don't expect that this causes the regression, as with the given input the vectorized loop with 33 iterations should be dominant enough (just my feeling, maybe wrong).

So far the "static analysis", but I doubt this is enough.
With Intel VTune I see some results, but with my interpretation the conclusions are just the same as stated in this comment.
I hope some JIT experts can shed some light on this (and give some advices on how to investigate, as I'm eager to learn).

Machine code for .NET 6 (for reference)
; System.String.Replace(Char, Char)
       push      r15
       push      r14
       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,28
       vzeroupper
       mov       rsi,rcx
       movzx     edi,dx
       movzx     ebx,r8w
       cmp       edi,ebx
       jne       short M01_L00
       mov       rax,rsi
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
M01_L00:
       lea       rbp,[rsi+0C]
       mov       rcx,rbp
       mov       r14d,[rsi+8]
       mov       r8d,r14d
       mov       edx,edi
       call      System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)
       mov       r15d,eax
       test      r15d,r15d
       jge       short M01_L01
       mov       rax,rsi
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
M01_L01:
       mov       esi,r14d
       sub       esi,r15d
       mov       ecx,r14d
       call      System.String.FastAllocateString(Int32)
       mov       r14,rax
       test      r15d,r15d
       jle       short M01_L02
       cmp       [r14],r14d
       lea       rcx,[r14+0C]
       mov       rdx,rbp
       mov       r8d,r15d
       add       r8,r8
       call      System.Buffer.Memmove(Byte ByRef, Byte ByRef, UIntPtr)
M01_L02:
       movsxd    rax,r15d
       add       rax,rax
       add       rbp,rax
       cmp       [r14],r14d
       lea       rdx,[r14+0C]
       add       rdx,rax
       cmp       esi,10
       jl        short M01_L04
       imul      eax,edi,10001
       vmovd     xmm0,eax
       vpbroadcastd ymm0,xmm0
       imul      eax,ebx,10001
       vmovd     xmm1,eax
       vpbroadcastd ymm1,xmm1
M01_L03:
       vmovupd   ymm2,[rbp]
       vpcmpeqw  ymm3,ymm2,ymm0
       vpand     ymm4,ymm1,ymm3
       vpandn    ymm2,ymm3,ymm2
       vpor      ymm2,ymm4,ymm2
       vmovupd   [rdx],ymm2
       add       rbp,20
       add       rdx,20
       add       esi,0FFFFFFF0
       cmp       esi,10
       jge       short M01_L03
M01_L04:
       test      esi,esi
       jle       short M01_L08
       nop       word ptr [rax+rax]
M01_L05:
       movzx     eax,word ptr [rbp]
       mov       rcx,rdx
       cmp       eax,edi
       je        short M01_L06
       jmp       short M01_L07
M01_L06:
       mov       eax,ebx
M01_L07:
       mov       [rcx],ax
       add       rbp,2
       add       rdx,2
       dec       esi
       test      esi,esi
       jg        short M01_L05
M01_L08:
       mov       rax,r14
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
; Total bytes of code 307

@AndyAyersMS
Copy link
Member

(should we be surprised that disabling PGO didn't make any difference? perhaps it doesn't exercise this method? cc @AndyAyersMS )

Hard to say without looking deeper -- from the .NET 7 code above I would guess PGO is driving the code layout changes.

For the .NET 7 you can use DOTNET_JitDIsasm in BDN to obtain the jit disasm which will tell you if there was PGO found (at least for the root method).

@danmoseley
Copy link
Member

I created #74771 and crudely pasted in the above discussion. Let's please continue there.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.