- Intel intrinsics
- Neon intrinsics
cd dev git clone --recursive https://github.com/Wunkolo/Intriman.git
cd Intriman mkdir build cd build cmake .. make ./Intriman data-latest.xml
rm -rf ../../docs mv ./docs ../../
The process of switching to SIMDe is quite trivial:
- Include the relevant header from SIMDe. In this case,
sse4.1.h. - Rename relevant types and functions.
__m128ibecomessimde__m128i, and_mm_*functions becomesimde_mm_*. - Change compiler flags from
-msse4.1to-march=native.
_may_i_use_cpu_feature
https://scc.ustc.edu.cn/zlsc/chinagrid/intel/compiler_c/main_cls/index.htm#GUID-4284FE01-6BA7-4459-8973-2DF2DB889B4B.htm
gcc -mavx2 -dM -E - < /dev/null | egrep "SSE|AVX" | sort
x86intrin.h: x86 instructions mmintrin.h: MMX (Pentium MMX!) mm3dnow.h: 3dnow! (K6-2) (deprecated) xmmintrin.h: SSE + MMX (Pentium 3, Athlon XP) emmintrin.h: SSE2 + SSE + MMX (Pentiuem 4, Ahtlon 64) pmmintrin.h: SSE3 + SSE2 + SSE + MMX (Pentium 4 Prescott, Ahtlon 64 San Diego) tmmintrin.h: SSSE3 + SSE3 + SSE2 + SSE + MMX (Core 2, Bulldozer) popcntintrin.h: POPCNT (Core i7, Phenom subset of SSE4.2 and SSE4A) ammintrin.h: SSE4A + SSE3 + SSE2 + SSE + MMX (Phenom) smmintrin.h: SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Core i7, Bulldozer) nmmintrin.h: SSE4_2 + SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Core i7, Bulldozer) wmmintrin.h: AES (Core i7 Westmere, Bulldozer) immintrin.h: AVX, SSE4_2 + SSE4_1 + SSSE3 + SSE3 + SSE2 + SSE + MMX (Core i7 Sandy Bridge, Bulldozer)
- http://www.g-truc.net/post-0359.html
https://github.com/WojciechMula/simd-sort
https://www.geeksforgeeks.org/quicksort-better-mergesort/
https://stackoverflow.com/questions/44479069/generate-code-for-multiple-simd-architectures
target_clones("sse4.1,avx")
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target_005fclones-function-attribute
https://www.infoq.com/presentations/simdjson-parser/
- Read all content
- Check is valid JSON
- Check Unicode Encoding
- Parse Numbers
- Build DOM
- And... Reduce the number of misprediction by doing more work per iteration:
while loop to write random odd vals w/ simple if => from 3 to 15 cycles per val!
branchless:
while (howmany != 0) {
val = random();
out[index] = val;
index += (val & 1); // write only odd
homany--;
}https://github.com/simdjson/simdjson/blob/master/doc/tape.md
https://github.com/lemire/fastvalidate-utf-8/blob/master/include/simdutf8check.h
verify all bytes <= 244. "saturated subtraction" is non-zero only if x > 244
_mm256_subs_epu8(curr_bytes, 244)
When both AVX2 and SSE4 are available, the decision whether to use AVX2 is non-obvious. AVX2 vectors are twice as wide, but require a higher power license (integer multiplications count as 'heavy' instructions) and can thus reduce the clock frequency of the core or entire socket(!) on Haswell systems. This partially explains the observed 1.25x (not 2x) speedup over SSE4. Moreover, it is inadvisable to only sporadically use AVX2 instructions because there is also a ~56K cycle warmup period during which AVX2 operations are slower, and Haswell can even stall during this period. Thus, we recommend avoiding AVX2 for infrequent hashing if the rest of the application is also not using AVX2. For any input larger than 1 MiB, it is probably worthwhile to enable AVX2.
The best way to init auto variables to zero http://www.pixelbeat.org/programming/gcc/auto_init.html
reducing c++ runtime size overhead using supc++ http://www.pixelbeat.org/programming/gcc/supc++/
One can trigger gdb breakpoints from code by inserting:
asm("int $0x3\n");