Description
On amd64, it looks like memclrNoHeapPointers performs better than the code that the compiler substitutes it with for a constant length "Zero" SSA operation, for byte sizes between 1024 and 2048 bytes. In that range, memclrNoHeapPointers use avx2 instructions to do the clear, not using the rep stos until the size is at least 2048 bytes, while the compiler will generate a rep stos.
There's a comment in the assembly memclrNoHeapPointers about why that's done:
Line 47 in 96a6e14
I put together a CL 681496 with some benchmarks those sizes and ran it on the C3 perf gomotes. I put the results in the change description.
I think in those cases we sholudn't turn memclrNoHeapPointers call with a constant size into a Zero? Or we could copy what memclrNoHeapPointers does?
Also: I haven't tested this, but there's a branch in the memclrNoHeapPointers that will have different behavior for clears 32M or larger so we may want to investigate that.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Activity
gabyhelp commentedon Jun 13, 2025
Related Issues
Related Code Changes
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
randall77 commentedon Jun 13, 2025
I was just looking at this code in an effort to get rid of things like the DUFFZERO operations. The relevant CL is https://go-review.googlesource.com/c/go/+/678937. Maybe this issue should wait until that code is in, because it is changing how
Zero
gets implemented.I found a cutover between using
SSE
andrep stos
at around 1.4KB. I'm sure this is heavily dependent on what particular chip you're on.We can't use
AVX
in generated code without a feature check. A feature check isn't very expensive, especially compared to writing more than a kilobyte, but it does mean ~2x the generated code because we need a fallback strategy.cherrymui commentedon Jun 13, 2025
Perhaps we could use
rep stos
as the fallback, so it will be fewer code to generate. SSE-only machines may get slower, but most machines, especially the ones performance sensitive users care, probably have AVX.matloob commentedon Jun 25, 2025
Sounds good! I'd be interested in helping once the duffzero changes are in.
Would we be able to only generate the AVX instructions when we compile with GOAMD64=v3+?
randall77 commentedon Jun 25, 2025
For
GOAMD64<v3
, we would need a feature check. Similar to how we domath/bits.OnesCount
today.It's just a test of a global variable and a branch to fallback code that doesn't use AVX.