Skip to content

cmd/compile: on amd64, compiler's Zero operation is slower than memclrNoHeapPointers for 1024-2048 bytes #74171

Open
@matloob

Description

@matloob
Contributor

cc @randall77 @mknyszek

On amd64, it looks like memclrNoHeapPointers performs better than the code that the compiler substitutes it with for a constant length "Zero" SSA operation, for byte sizes between 1024 and 2048 bytes. In that range, memclrNoHeapPointers use avx2 instructions to do the clear, not using the rep stos until the size is at least 2048 bytes, while the compiler will generate a rep stos.

There's a comment in the assembly memclrNoHeapPointers about why that's done:

// If the size is less than 2kb, do not use ERMS as it has a big start-up cost.

I put together a CL 681496 with some benchmarks those sizes and ran it on the C3 perf gomotes. I put the results in the change description.

I think in those cases we sholudn't turn memclrNoHeapPointers call with a constant size into a Zero? Or we could copy what memclrNoHeapPointers does?

Also: I haven't tested this, but there's a branch in the memclrNoHeapPointers that will have different behavior for clears 32M or larger so we may want to investigate that.

Activity

added
ImplementationIssues describing a semantics-preserving change to the Go implementation.
on Jun 13, 2025
randall77

randall77 commented on Jun 13, 2025

@randall77
Contributor

I was just looking at this code in an effort to get rid of things like the DUFFZERO operations. The relevant CL is https://go-review.googlesource.com/c/go/+/678937. Maybe this issue should wait until that code is in, because it is changing how Zero gets implemented.

I found a cutover between using SSE and rep stos at around 1.4KB. I'm sure this is heavily dependent on what particular chip you're on.

We can't use AVX in generated code without a feature check. A feature check isn't very expensive, especially compared to writing more than a kilobyte, but it does mean ~2x the generated code because we need a fallback strategy.

cherrymui

cherrymui commented on Jun 13, 2025

@cherrymui
Member

We can't use AVX in generated code without a feature check. A feature check isn't very expensive, especially compared to writing more than a kilobyte, but it does mean ~2x the generated code because we need a fallback strategy.

Perhaps we could use rep stos as the fallback, so it will be fewer code to generate. SSE-only machines may get slower, but most machines, especially the ones performance sensitive users care, probably have AVX.

added
NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.
on Jun 13, 2025
added this to the Backlog milestone on Jun 13, 2025
modified the milestones: Backlog, Go1.26 on Jun 18, 2025
moved this from In Progress to Todo in Go Compiler / Runtimeon Jun 18, 2025
matloob

matloob commented on Jun 25, 2025

@matloob
ContributorAuthor

Sounds good! I'd be interested in helping once the duffzero changes are in.

Would we be able to only generate the AVX instructions when we compile with GOAMD64=v3+?

randall77

randall77 commented on Jun 25, 2025

@randall77
Contributor

For GOAMD64<v3, we would need a feature check. Similar to how we do math/bits.OnesCount today.
It's just a test of a global variable and a branch to fallback code that doesn't use AVX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

ImplementationIssues describing a semantics-preserving change to the Go implementation.NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performancecompiler/runtimeIssues related to the Go compiler and/or runtime.

Type

No type

Projects

Status

Todo

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @mknyszek@randall77@gopherbot@cherrymui@matloob

      Issue actions

        cmd/compile: on amd64, compiler's Zero operation is slower than memclrNoHeapPointers for 1024-2048 bytes · Issue #74171 · golang/go