-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized string.Replace(char, char) #67049
Conversation
Nice - could you maybe get measurements for longer strings? I would imagine they are fairly common, and this change looks even better there. I see our perf coverage is missing longer string(s) and probably should have one. |
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
Looks better in general, now only one regression in the len 32 case, but tiny in absolute time.
Will add longer strings there. |
If we remove Vector path we'll regress Mono that supports Vector128 only in LLVM mode
As you noted - It doesn't handle small inputs well when AVX is available 😢 e.g. #64899 (comment) |
|
||
if (firstIndex < 0) | ||
int firstIndex; | ||
if (oldChar == newChar || (firstIndex = IndexOf(oldChar)) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it against the guidelines to perform an assignment inside an if
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, but it gives nice machine code here 😉
It's about collapsing the epilogs for the first checks (oldChar == newChar
, and firstIndex < 0
).
Maybe I think too complicated now, but the other option would be using goto
for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least it would be good to have a comment calling out the assignment why it is being done here.
Otherwise, at a glance it may look like a potential bug or comparison using ==
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, the below is also the more "natural" pattern and more readable:
if (oldChar == newChar)
{
return this;
}
int firstIndex = IndexOf(oldChar);
if (firstIndex < 0)
{
return this;
}
Ideally the JIT would handle such a pattern "correctly" and optimize it down accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding so what's my action here?
- add the comment explaining that the epilogs get collapsed in generated machine-code
- write it more naturally and file an issue for the JIT
- do both, and write it naturally once the JIT issue is fixed
(I'm leaning towards the last option, for perf-reasons -- except JIT issue will be fixed for .NET 7 😉)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think writing it naturally and filing an issue for the JIT is the best choice and don't expect the cost to be significant here.
If the cost is more significant, then adding a comment calling out the assignment and why as well as filing an issue for the JIT is the next best option.
If the issue is actually being fixed for .NET 7, that's all the more reason to do the first approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the codegen issue tracked by #8883?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code change here (back to where it was) with 30889ac
If I read the issue from the previous comment correct, so this should cover that case.
ref ushort pSrc = ref Unsafe.Add(ref Unsafe.As<char, ushort>(ref _firstChar), copyLength); | ||
ref ushort pDst = ref Unsafe.Add(ref Unsafe.As<char, ushort>(ref result._firstChar), copyLength); | ||
ref ushort pSrc = ref Unsafe.Add(ref Unsafe.As<char, ushort>(ref _firstChar), (nint)(uint)copyLength); | ||
ref ushort pDst = ref Unsafe.Add(ref Unsafe.As<char, ushort>(ref result._firstChar), (nint)(uint)copyLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
off-topic: gosh, the same line with raw pointers is basically
ushort* pDst = ((ushort*)result._firstChar)[copyLength]
"Safe" Unsafe is killing me 🤦♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here 🙈 It reads (and writes) like a mess.
Pinning wasn't used here, so I didn't use it too -- also I expect a little bit of regression then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might become slightly more readable if the Unsafe.As
was the outermost.
We could also expose an internal only Vector.LoadUnsafe(ref T, nuint index)
API which would also simplify things here.
src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs
Outdated
Show resolved
Hide resolved
// Thus we can eliminate the scalar processing of the remaining elements. | ||
// We perform this operation even if there are 0 elements remaining, as it is cheaper than the | ||
// additional check which would introduce a branch here. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps worth adding an assert that current Debug.Assert(this.Length - i <= Vector<ushort>.Count)
to make sure we won't skip any data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think in this case a test should fail?
I'll re-check the tests and make sure that case is covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests cover these cases, so I don't see a need for the Debug.Assert -- but I'll add it of course if you want.
runtime/src/libraries/Common/tests/Tests/System/StringTests.cs
Lines 4566 to 4578 in 8ed8517
// -------------------- For Vector<ushort>.Count == 8 (SSE2 / ARM NEON) -------------------- | |
[InlineData("Aaaaaaaa", 'A', 'a', "aaaaaaaa")] // Single iteration of vectorised path; no remainders through non-vectorised path | |
// Three leading 'a's before a match (copyLength > 0), Single iteration of vectorised path; no remainders through non-vectorised path | |
[InlineData("aaaAaaaaaaa", 'A', 'a', "aaaaaaaaaaa")] | |
// Single iteration of vectorised path; 3 remainders through non-vectorised path | |
[InlineData("AaaaaaaaaAa", 'A', 'a', "aaaaaaaaaaa")] | |
// ------------------------- For Vector<ushort>.Count == 16 (AVX2) ------------------------- | |
[InlineData("AaaaaaaaAaaaaaaa", 'A', 'a', "aaaaaaaaaaaaaaaa")] // Single iteration of vectorised path; no remainders through non-vectorised path | |
// Three leading 'a's before a match (copyLength > 0), Single iteration of vectorised path; no remainders through non-vectorised path | |
[InlineData("aaaAaaaaaaaAaaaaaaa", 'A', 'a', "aaaaaaaaaaaaaaaaaaa")] | |
// Single iteration of vectorised path; 3 remainders through non-vectorised path | |
[InlineData("AaaaaaaaAaaaaaaaaAa", 'A', 'a', "aaaaaaaaaaaaaaaaaaa")] | |
// ----------------------------------- General test data ----------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks! cc @stephentoub @tannergooding
@fanyang-mono is the work being done in Mono only covering Vector64/128 for ARM64 and only in LLVM or does it also extend to Mono JIT and x86/x64? |
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsDescriptionThe vectorized path uses Also streamlined the beginning of the method by avoiding multiple return, thus collapsing the epilogs. asm (excerpt) before push r15
push r14
push r12
push rdi
push rsi
push rbp
push rbx
sub rsp,20
vzeroupper
mov rsi,rcx
movzx edi,dx
movzx ebx,r8w
cmp edi,ebx
jne short M01_L00
mov rax,rsi
vzeroupper
add rsp,20
pop rbx
pop rbp
pop rsi
pop rdi
pop r12
pop r14
pop r15
ret
M01_L00:
mov r8,[rsi+8]
lea rcx,[r8+0C]
mov r8d,[r8+8]
mov edx,edi
call System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)
mov ebp,eax
test ebp,ebp
jge short M01_L01
mov rax,rsi
vzeroupper
add rsp,20
pop rbx
pop rbp
pop rsi
pop rdi
pop r12
pop r14
pop r15
ret asm (excerpt) after push r15
push r14
push r12
push rdi
push rsi
push rbp
push rbx
sub rsp,20
vzeroupper
mov rsi,rcx
movzx edi,dx
movzx ebx,r8w
cmp edi,ebx
je short M01_L00
mov r8,[rsi+8]
lea rcx,[r8+0C]
mov r8d,[r8+8]
mov edx,edi
call System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)
mov ebp,eax
test ebp,ebp
jge short M01_L01
M01_L00:
mov rax,rsi
vzeroupper
add rsp,20
pop rbx
pop rbp
pop rsi
pop rdi
pop r12
pop r14
pop r15
ret And instead of moving the source-ref, dest-ref forward, and decreasing the reaminingCount, direct indexing is used -- thus eliding two additions and making the loop bodies smaller. asm (excerpt) beforeM01_L03:
vmovupd ymm2,[rcx]
vpcmpeqw ymm3,ymm2,ymm0
vpand ymm4,ymm1,ymm3
vpandn ymm2,ymm3,ymm2
vpor ymm2,ymm4,ymm2
vmovupd [rdx],ymm2
add rcx,20
add rdx,20
add r14d,0FFFFFFF0
cmp r14d,10
jge short M01_L03 asm (excerpt) afterM01_L03:
lea r9,[rcx+rcx]
vmovupd ymm2,[rax+r9]
vpcmpeqw ymm3,ymm2,ymm0
vpand ymm4,ymm1,ymm3
vpandn ymm2,ymm3,ymm2
vpor ymm2,ymm4,ymm2
vmovupd [rdx+r9],ymm2
add rcx,10
cmp rcx,r8
jle short M01_L03 BenchmarkinfoBenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1586 (21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.2.22153.17
[Host] : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.15202), X64 RyuJIT
There is quite a huge improvement when some elements are remaining, but a litte regression if no or just a few elements are remaining. Open questionsRight now There is the debate / consideration about
I checked the codegen for Vector128 and Vector256, it emits the same / similar code (that's why #67039, #67038 got opened).
|
On arm64 SIMD is only supported when using the LLVM backend. On x86/x64, SIMD is supported either when LLVM is on or when LLVM is off and using Vector128. The code lives here: |
The main question basically is - can we replace |
Besides that do we have any data about typical ranges of string length where chars should be replaced? (from api metrics, etc) |
btw, I've just checked - Mono on x64 (Linux) doesn't support Vector128 without LLVM. It prints false for all Sse*.IsSupported while |
It's going to depend on hardware, among other things. Typically, well-optimized C/C++ code defines the cutoff as around 128 to 256-bytes (2-4 cache lines) and will optionally use V256 in the cases where the data is bigger than this. A lot of our C# code defines a cutoff of 64-bytes instead (1 cache line). |
-- Where I'm defining "well-optimized" as some of the core memory and string functions provided by the C Runtime for MSVC, GCC, and Clang as they are some of the most critical, most profiled, and most deeply invested in pieces of code that are vectorized across the board. |
This is different than what I posted earlier. Let me double check. |
It is a little bit confusing to say that if Vector or Vector128 is supported on a specific platform. Because mono is still in the process of adding SIMD intrinsics for the methods. Recently, we just added some for arm64 and some for amd64. Regarding to what is concerned to this PR, these are the methods that we would like to know if they are intrinsified/supported:
And
Currently, on amd64 some SIMD instructions were added for Vector128 but there isn't any methods being intrinsified without LLVM. With LLVM, On amd64, In summary, currently, neither |
When the remaining length is a multiple of the vector size, then the remainder is processed twice. This is redundant, and not needed. This commit changes that, so that the remainder is processed only once when the remaining elements match.
src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs
Outdated
Show resolved
Hide resolved
Thanks @gfoidl |
After discussion with @stephentoub , I'm going to backport this to .NET 7. While not meeting the letter of the criteria, it really should have gone in yesterday, before the snap, but CI was broken; and this was open for months, a with fair bit of waiting on us. |
/backport to release/7.0-rc1 |
Started backporting to release/7.0-rc1: https://github.com/dotnet/runtime/actions/runs/2872287360 |
* Optimized string.Replace(char, char) vector code path * Optimized code pathes even further * Do vectorized operation at the end of the string only once When the remaining length is a multiple of the vector size, then the remainder is processed twice. This is redundant, and not needed. This commit changes that, so that the remainder is processed only once when the remaining elements match. * Don't use trick for collapsed epilogs Cf. #67049 (comment) * Handle remainder vectorized even if remainingLength <= Vector<ushort>.Count and added tests for this * Introduce (internal) Vector.LoadUnsafe and Vector.StoreUnsafe and use it in string.Replace(char, char) * Avoid Unsafe.As<char, ushort> reinterpret casts by introducing string.GetRawStringDataAsUshort() internal method * Fixed copy/paste error (from local dev to repo) * PR Feedback * Fixed bug and added tests for this * Make condition about lengthToExamine clearer as suggested Co-authored-by: Günther Foidl <[email protected]>
@gfoidl, at least on my machine, comparing string.Replace in .NET 6 vs .NET 7, multiple examples I've tried have shown .NET 7 to have regressed, e.g. const string Input = """
Whose woods these are I think I know.
His house is in the village though;
He will not see me stopping here
To watch his woods fill up with snow.
My little horse must think it queer
To stop without a farmhouse near
Between the woods and frozen lake
The darkest evening of the year.
He gives his harness bells a shake
To ask if there is some mistake.
The only other sound’s the sweep
Of easy wind and downy flake.
The woods are lovely, dark and deep,
But I have promises to keep,
And miles to go before I sleep,
And miles to go before I sleep.
""";
[Benchmark]
public string Replace() => Input.Replace('I', 'U');
Do you see otherwise? |
Hm, that is not expected... When i duplicate the BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1889 (21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-preview.7.22377.5
[Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT
so a result I'd expect, as after the vectorized loop 6 chars are remaining that the old-code processes in the for-loop whilst the new-code does one vectorized pass. I checked the dasm (via DisassemblyDiagnoser of BDN) and that looks OK. Can this be something from different machine-code layout (loops), PGO, etc. that causes the difference between .NET 6 and .NET 7? |
This could be related to stale PGO data |
Is there POGO data en-route that has trained with this change in place? I am not sure how to follow it. |
Also, it wouldn't matter here, but are we consuming POGO data trained on main bits in the release branches? |
I don't think this particular case is related to stale PGO data. I set |
I ran the example above with var config = DefaultConfig.Instance
.AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
.AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
.AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
.AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
.AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"))
.AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"))
.AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0").AsBaseline())
.AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"));
BenchmarkRunner.Run(typeof(Program).Assembly, args: args, config: config); and got
code https://gist.github.com/danmoseley/c31bc023d6ec671efebff7352e3b3251 (should we be surprised that disabling PGO didn't make any difference? perhaps it doesn't exercise this method? cc @AndyAyersMS ) |
and just for interest
|
Yes (sorry for slow response, was Sunday...). This is the machine code I get (from BDN) when run @danmoseley's benchmark (.NET 7 only). Left there some comments. ; Program.Replace()
mov rcx,1C003C090A0
mov rcx,[rcx]
mov edx,49
mov r8d,55
jmp qword ptr [7FFEFA7430C0]
; Total bytes of code 30
; System.String.Replace(Char, Char)
push r15
push r14
push rdi
push rsi
push rbp
push rbx
sub rsp,28
vzeroupper
mov rsi,rcx
mov edi,edx
mov ebx,r8d
movzx ecx,di
movzx r8d,bx
cmp ecx,r8d
je near ptr M01_L09
lea rcx,[rsi+0C]
mov r8d,[rsi+8]
movsx rdx,di
call qword ptr [7FFEFA7433C0]
mov ebp,eax
test ebp,ebp
jge short M01_L00
mov rax,rsi ; uncommon case, could jump to M01_L09 instead
vzeroupper
add rsp,28
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
pop r15
ret
M01_L00:
mov ecx,[rsi+8]
sub ecx,ebp
mov r14d,ecx
mov ecx,[rsi+8]
call System.String.FastAllocateString(Int32)
mov r15,rax
test ebp,ebp
jg near ptr M01_L10 ; should be common path, I don't expect to jump to the end, then back to here
M01_L01:
mov eax,ebp
lea rax,[rsi+rax*2+0C]
cmp [r15],r15b
mov edx,ebp
lea rdx,[r15+rdx*2+0C]
xor ecx,ecx
cmp dword ptr [rsi+8],10
jl near ptr M01_L07
movzx r8d,di
imul r8d,10001 ; this is tracked in https://github.com/dotnet/runtime/issues/67038, .NET 6 has the same issue, so no difference expected
vmovd xmm0,r8d
vpbroadcastd ymm0,xmm0 ; should be vpbroadcastb, see comment above
movzx r8d,bx
imul r8d,10001
vmovd xmm1,r8d
vpbroadcastd ymm1,xmm1 ; vpbroadcastb (see above)
cmp r14,10
jbe short M01_L03
add r14,0FFFFFFFFFFFFFFF0
M01_L02:
lea r8,[rax+rcx*2]
vmovupd ymm2,[r8]
vpcmpeqw ymm3,ymm2,ymm0
vpand ymm4,ymm3,ymm1 ; the vpand, vpandn, vpor series should be vpblendvb, https://github.com/dotnet/runtime/issues/67039 tracked this
vpandn ymm2,ymm3,ymm2 ; the "duplicated code for string.Replace" method emits vpblendvb as expected, but
vpor ymm2,ymm4,ymm2 ; if string.Replace from .NET 7.0.0 (7.0.22.42212) (.NET SDK=7.0.100-rc.2.22426.5) is used, then it's this series
lea r8,[rdx+rcx*2]
vmovupd [r8],ymm2
add rcx,10
cmp rcx,r14
jb short M01_L02
M01_L03:
mov ecx,[rsi+8]
add ecx,0FFFFFFF0
add rsi,0C
lea rsi,[rsi+rcx*2]
vmovupd ymm2,[rsi]
vpcmpeqw ymm3,ymm2,ymm0
vpand ymm0,ymm3,ymm1
vpandn ymm1,ymm3,ymm2
vpor ymm2,ymm0,ymm1
lea rax,[r15+0C]
lea rax,[rax+rcx*2]
vmovupd [rax],ymm2
jmp short M01_L08
M01_L04:
movzx r8d,word ptr [rax+rcx*2]
lea r9,[rdx+rcx*2]
movzx r10d,di
cmp r8d,r10d
je short M01_L05 ; not relevant for .NET 6 -> .NET 7 comparison in this test-case, but
jmp short M01_L06 ; one jump could be avoided?!
M01_L05:
movzx r8d,bx
M01_L06:
mov [r9],r8w
inc rcx
M01_L07:
cmp rcx,r14
jb short M01_L04
M01_L08:
mov rax,r15
vzeroupper
add rsp,28
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
pop r15
ret
M01_L09: ; expect the mov rax,{r15,rsi} the epilogs are the same, can they be collapsed to
mov rax,rsi ; get less machine code?
vzeroupper
add rsp,28
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
pop r15
ret
M01_L10: ; this block should be common enough, so should be on the jump-root (see comment above)
cmp [r15],r15b ; it's the Memmove-call
lea rcx,[r15+0C]
lea rdx,[rsi+0C]
mov r8d,ebp
add r8,r8
call qword ptr [7FFEFA7399F0]
jmp near ptr M01_L01
; Total bytes of code 383 So from code-layout one major difference to .NET 6 is that the call to I also wonder why The beginning of the method, right after the prolog, looks different between .NET 6 and .NET 7, although this PR didn't change anything here. I don't expect that this causes the regression, as with the given input the vectorized loop with 33 iterations should be dominant enough (just my feeling, maybe wrong). So far the "static analysis", but I doubt this is enough. Machine code for .NET 6 (for reference); System.String.Replace(Char, Char)
push r15
push r14
push rdi
push rsi
push rbp
push rbx
sub rsp,28
vzeroupper
mov rsi,rcx
movzx edi,dx
movzx ebx,r8w
cmp edi,ebx
jne short M01_L00
mov rax,rsi
vzeroupper
add rsp,28
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
pop r15
ret
M01_L00:
lea rbp,[rsi+0C]
mov rcx,rbp
mov r14d,[rsi+8]
mov r8d,r14d
mov edx,edi
call System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)
mov r15d,eax
test r15d,r15d
jge short M01_L01
mov rax,rsi
vzeroupper
add rsp,28
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
pop r15
ret
M01_L01:
mov esi,r14d
sub esi,r15d
mov ecx,r14d
call System.String.FastAllocateString(Int32)
mov r14,rax
test r15d,r15d
jle short M01_L02
cmp [r14],r14d
lea rcx,[r14+0C]
mov rdx,rbp
mov r8d,r15d
add r8,r8
call System.Buffer.Memmove(Byte ByRef, Byte ByRef, UIntPtr)
M01_L02:
movsxd rax,r15d
add rax,rax
add rbp,rax
cmp [r14],r14d
lea rdx,[r14+0C]
add rdx,rax
cmp esi,10
jl short M01_L04
imul eax,edi,10001
vmovd xmm0,eax
vpbroadcastd ymm0,xmm0
imul eax,ebx,10001
vmovd xmm1,eax
vpbroadcastd ymm1,xmm1
M01_L03:
vmovupd ymm2,[rbp]
vpcmpeqw ymm3,ymm2,ymm0
vpand ymm4,ymm1,ymm3
vpandn ymm2,ymm3,ymm2
vpor ymm2,ymm4,ymm2
vmovupd [rdx],ymm2
add rbp,20
add rdx,20
add esi,0FFFFFFF0
cmp esi,10
jge short M01_L03
M01_L04:
test esi,esi
jle short M01_L08
nop word ptr [rax+rax]
M01_L05:
movzx eax,word ptr [rbp]
mov rcx,rdx
cmp eax,edi
je short M01_L06
jmp short M01_L07
M01_L06:
mov eax,ebx
M01_L07:
mov [rcx],ax
add rbp,2
add rdx,2
dec esi
test esi,esi
jg short M01_L05
M01_L08:
mov rax,r14
vzeroupper
add rsp,28
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
pop r15
ret
; Total bytes of code 307 |
Hard to say without looking deeper -- from the .NET 7 code above I would guess PGO is driving the code layout changes. For the .NET 7 you can use DOTNET_JitDIsasm in BDN to obtain the jit disasm which will tell you if there was PGO found (at least for the root method). |
I created #74771 and crudely pasted in the above discussion. Let's please continue there. |
Description
The vectorized path uses
Vector<ushort>
, so on x64-cpu with AVX2 available (and enabled) the vectorized path processes 16 chars / ushorts at once.For the remainder there are
[0, Vector<ushort>.Count)
elements left which are currently processed in a scalar way.As the operation is idempotent, we can avoid this scalar processing by just processing the final Vector.Count elements vectorized again. We do this anyway to avoid an additional branch.
Also streamlined the beginning of the method by avoiding multiple return, thus collapsing the epilogs.
asm (excerpt) before
asm (excerpt) after
And instead of moving the source-ref, dest-ref forward, and decreasing the reaminingCount, direct indexing is used -- thus eliding two additions and making the loop bodies smaller.
asm (excerpt) before
asm (excerpt) after
Benchmark
info
There is quite a huge improvement when some elements are remaining, but a litte regression if no or just a few elements are remaining.
But in absolute numbers that regression is tiny.
Open questions
Right now
Vector<ushort>
is used.This means that on hardware where 256 bit SIMD is supported 16 elements are processed at once. On hardware where only 128 bit SIMD is supported, it's 8 elements.
There is the debate / consideration about
Vector<T>
,Vector128<T>
,Vector256<T>
as xplat-intrinsics around, so here arises the question should the code be update to useVector128<T>
orVector256<T>
instead ofVector<T>
?In my opinion
Vector<T>
is a good choice here and should be kept asstring.Replace(char, char)
a length of >= 16 is quite common, so it's best when AVX2 is availableI checked the codegen for Vector128 and Vector256, it emits the same / similar code (that's why #67039, #67038 got opened).
The only question is for the count of elements processed at once, and potential warm-up, etc. that comes with AVX.