-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mir based implementation #11
Comments
The main loops of mir kernels are auto vectorised and unrolled by LDC. ; all *ps operations are operations on SIMD vectors
LBB2_14:
vmovups (%rsi,%rdi,4), %ymm8
vmulps %ymm15, %ymm8, %ymm9
vfmadd231ps (%rcx,%rdi,4), %ymm2, %ymm9 ; vectrorized FMA
vmovups %ymm9, (%rcx,%rdi,4)
vmulps %ymm8, %ymm8, %ymm8
vmulps %ymm13, %ymm8, %ymm8
vfmadd231ps (%rax,%rdi,4), %ymm7, %ymm8
vmulps %ymm1, %ymm8, %ymm10
vrsqrtps %ymm10, %ymm11 ; vectrorized SQRT
vmulps %ymm9, %ymm12, %ymm9
vmovaps %ymm12, %ymm14
vmulps %ymm11, %ymm10, %ymm12
vfmadd213ps %ymm0, %ymm12, %ymm11
vmulps %ymm6, %ymm12, %ymm12
vmulps %ymm11, %ymm12, %ymm11
vcmpneqps %ymm5, %ymm10, %ymm10
vandps %ymm11, %ymm10, %ymm10
vaddps %ymm4, %ymm10, %ymm10
vrcpps %ymm10, %ymm11
vmovups %ymm8, (%rax,%rdi,4)
vfmadd213ps %ymm3, %ymm11, %ymm10
vfmsub132ps %ymm11, %ymm11, %ymm10
vfmadd213ps (%rdx,%rdi,4), %ymm9, %ymm10
vmovups %ymm10, (%rdx,%rdi,4)
vmovups 32(%rsi,%rdi,4), %ymm8
vmulps %ymm15, %ymm8, %ymm9
vfmadd231ps 32(%rcx,%rdi,4), %ymm2, %ymm9
vmovups %ymm9, 32(%rcx,%rdi,4)
vmulps %ymm8, %ymm8, %ymm8
vmulps %ymm13, %ymm8, %ymm8
vfmadd231ps 32(%rax,%rdi,4), %ymm7, %ymm8
vmulps %ymm1, %ymm8, %ymm10
vrsqrtps %ymm10, %ymm11
vmulps %ymm11, %ymm10, %ymm12
vfmadd213ps %ymm0, %ymm12, %ymm11
vmulps %ymm6, %ymm12, %ymm12
vmulps %ymm11, %ymm12, %ymm11
vmovaps %ymm14, %ymm12
vcmpneqps %ymm5, %ymm10, %ymm10
vandps %ymm11, %ymm10, %ymm10
vaddps %ymm4, %ymm10, %ymm10
vrcpps %ymm10, %ymm11
vmulps %ymm9, %ymm12, %ymm9
vmovups %ymm8, 32(%rax,%rdi,4)
vfmadd213ps %ymm3, %ymm11, %ymm10
vfmsub132ps %ymm11, %ymm11, %ymm10
vfmadd213ps 32(%rdx,%rdi,4), %ymm9, %ymm10
vmovups %ymm10, 32(%rdx,%rdi,4)
addq $16, %rdi ; float8x2 per cycle
cmpq %rdi, %rbp
jne LBB2_14 |
Seconded, I am very interested in having a Mir implementation. Do you already have a git repo that I can clone? |
Just created https://github.com/libmir/vectorflow |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
You may not want to add additional dependencies. In other hand Mir Algorithm is stable enough and is de facto the same thing for Dlang as numpy for Python. Would you accept PRs with optimizations based on Mir Algorithm?
Mir Algorithm benefits are:
each
,reduce
,zip
(and others) implementations created with@fastmath
in mind.each
for the following code are generated to 1D loop because contiguous matrixes can be flattened.The following mir example generates few times faster code (if matrixes fit to cache).
BTW, current vectorflow code is not properly vectorized for LDC anyway because of
1.0 - beta
forces floats to convert to doubles. So, it should be1f - beta
instead.The current high level API can be preserved for backward compatibility if necessary. Let me know what do you think.
With Mir Algorithm (v0.6.16, reviewed for AVX2)
Current code
The text was updated successfully, but these errors were encountered: