cross3, dot3, scale, bias benchmark (AOS) - scalar always faster than zmath on M1

I'm consistently seeing scalar being faster on M1 mac, with -Doptimize=ReleaseFast

Example: ` cross3, dot3, scale, bias benchmark (AOS) - scalar version: 0.9780s, zmath version: 1.0045s`

I noticed that the 'swizzle' function call actually has extra CPU instructions generated - see the [dot4Old function in this godbolt](https://godbolt.org/z/EfM5b5qs9) and play around with the commented out line and the one next to it.

By changing `cross3` to use shuffle this seems to help the benchmark:
```
pub inline fn cross3(v0: Vec, v1: Vec) Vec {
    var xmm0 = @shuffle(f32, v0, undefined, [4]i32{ 1, 2, 0, 2 });
    var xmm1 = @shuffle(f32, v1, undefined, [4]i32{ 2, 0, 1, 3 });
    var result = xmm0 * xmm1;
    xmm0 = @shuffle(f32, xmm0, undefined, [4]i32{ 1, 2, 0, 3 });
    xmm1 = @shuffle(f32, xmm1, undefined, [4]i32{ 2, 0, 1, 3 });
    result = result - xmm0 * xmm1;
    return andInt(result, f32x4_mask3);
}
```

I recommend changing this everywhere. Also the dot2 is weird... there are a lot of potential perf improvements in the zmath area.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cross3, dot3, scale, bias benchmark (AOS) - scalar always faster than zmath on M1 #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cross3, dot3, scale, bias benchmark (AOS) - scalar always faster than zmath on M1 #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions