I'm consistently seeing scalar being faster on M1 mac, with -Doptimize=ReleaseFast
Example: cross3, dot3, scale, bias benchmark (AOS) - scalar version: 0.9780s, zmath version: 1.0045s
I noticed that the 'swizzle' function call actually has extra CPU instructions generated - see the dot4Old function in this godbolt and play around with the commented out line and the one next to it.
By changing cross3 to use shuffle this seems to help the benchmark:
pub inline fn cross3(v0: Vec, v1: Vec) Vec {
var xmm0 = @shuffle(f32, v0, undefined, [4]i32{ 1, 2, 0, 2 });
var xmm1 = @shuffle(f32, v1, undefined, [4]i32{ 2, 0, 1, 3 });
var result = xmm0 * xmm1;
xmm0 = @shuffle(f32, xmm0, undefined, [4]i32{ 1, 2, 0, 3 });
xmm1 = @shuffle(f32, xmm1, undefined, [4]i32{ 2, 0, 1, 3 });
result = result - xmm0 * xmm1;
return andInt(result, f32x4_mask3);
}
I recommend changing this everywhere. Also the dot2 is weird... there are a lot of potential perf improvements in the zmath area.