55% memchr optimization with SIMD on x86-64 | Macros config SIMD #8421

Kraionix · 2025-02-21T15:05:05Z

macros config SIMD

Added including of a compiler-specific intrinsics library.

#if !defined(IMGUI_DISABLE_SIMD)
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) || defined(__clang__)
#include <x86intrin.h>
#endif
#endif

Updated macros for configuring a build from SIMD to x86-64.

IMGUI_DISABLE_SIMD
IMGUI_DISABLE_SSE
IMGUI_DISABLE_SSE4_2
IMGUI_DISABLE_AVX
IMGUI_DISABLE_AVX2
IMGUI_ENABLE_SSE
IMGUI_ENABLE_SSE4_2
IMGUI_ENABLE_AVX
IMGUI_ENABLE_AVX2

#if (defined __x86_64__ || defined _M_X64) && !defined(IMGUI_DISABLE_SIMD)
#if (defined __SSE__  || (defined(_M_IX86_FP) && (_M_IX86_FP >= 1))) && !defined(IMGUI_DISABLE_SSE)
#define IMGUI_ENABLE_SSE
#endif
#if defined (__SSE4_2__) && !defined(IMGUI_DISABLE_SSE4_2)
#define IMGUI_ENABLE_SSE4_2
#endif
#if (defined __AVX__) && !defined(IMGUI_DISABLE_AVX)
#define IMGUI_ENABLE_AVX
#endif
#if (defined __AVX2__) && !defined(IMGUI_DISABLE_AVX2)
#define IMGUI_ENABLE_AVX2
#endif
#endif

SIMD ImMemchr

Created optimized ImMemchr functions on SSE and AVX2.
Replaced using memchr with ImMemchr.

Benchmark

Benchmark description

Search for all lines of length 131, ending with \n, in a std::string buffer filled with random ASCII characters. Buffer sizes range from 16 MB to 1 GB. Various memchr implementations using SSE and AVX2 are tested for performance.

System Specifications

CPU: Intel Core i9-10980XE

Cores Frequency: 4400 MHz (All Cores)
Uncore Frequency: 3200 Mhz
Cores/Threads: 18 Cores / 36 Threads

RAM: 128GB (4x32 GB)

Memory Frequency: 4000 MHz
Timings: 18-22-22-42
Channels: Quad-channel

Google benchmark results

Run on (36 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x18)
  L1 Instruction 32 KiB (x18)
  L2 Unified 1024 KiB (x18)
  L3 Unified 25344 KiB (x1)
--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
ImMemchr_AVX2_PREFETCH/1048576         0.048 ms        0.048 ms        14452 bytes_per_second=20.5284Gi/s
ImMemchr_AVX2_PREFETCH/2097152         0.096 ms        0.096 ms         7467 bytes_per_second=20.2908Gi/s
ImMemchr_AVX2_PREFETCH/16777216        0.799 ms        0.802 ms          896 bytes_per_second=19.4783Gi/s
ImMemchr_AVX2_PREFETCH/134217728        7.28 ms         7.29 ms           90 bytes_per_second=17.1429Gi/s
ImMemchr_AVX2_PREFETCH/1073741824       58.2 ms         58.2 ms           11 bytes_per_second=17.1707Gi/s
ImMemchr_AVX2/1048576                  0.053 ms        0.053 ms        10000 bytes_per_second=18.3824Gi/s
ImMemchr_AVX2/2097152                  0.106 ms        0.107 ms         6400 bytes_per_second=18.1818Gi/s
ImMemchr_AVX2/16777216                 0.924 ms        0.941 ms          747 bytes_per_second=16.6Gi/s
ImMemchr_AVX2/134217728                 9.97 ms         9.79 ms           75 bytes_per_second=12.766Gi/s
ImMemchr_AVX2/1073741824                79.8 ms         78.1 ms            7 bytes_per_second=12.8Gi/s
ImMemchr_SSE_PREFETCH/1048576          0.056 ms        0.056 ms        11200 bytes_per_second=17.5Gi/s
ImMemchr_SSE_PREFETCH/2097152          0.112 ms        0.112 ms         5600 bytes_per_second=17.5Gi/s
ImMemchr_SSE_PREFETCH/16777216         0.918 ms        0.920 ms          747 bytes_per_second=16.9773Gi/s
ImMemchr_SSE_PREFETCH/134217728         8.25 ms         8.12 ms           75 bytes_per_second=15.3846Gi/s
ImMemchr_SSE_PREFETCH/1073741824        65.9 ms         65.3 ms           11 bytes_per_second=15.3043Gi/s
ImMemchr_SSE/1048576                   0.059 ms        0.059 ms        11200 bytes_per_second=16.6667Gi/s
ImMemchr_SSE/2097152                   0.117 ms        0.117 ms         6400 bytes_per_second=16.6667Gi/s
ImMemchr_SSE/16777216                   1.02 ms         1.00 ms          640 bytes_per_second=15.6098Gi/s
ImMemchr_SSE/134217728                  10.4 ms         10.4 ms           75 bytes_per_second=12Gi/s
ImMemchr_SSE/1073741824                 83.7 ms         83.3 ms            9 bytes_per_second=12Gi/s
ImMemchr_CSTD/1048576                  0.068 ms        0.068 ms        11200 bytes_per_second=14.2857Gi/s
ImMemchr_CSTD/2097152                  0.134 ms        0.134 ms         5600 bytes_per_second=14.5833Gi/s
ImMemchr_CSTD/16777216                  1.17 ms         1.17 ms          640 bytes_per_second=13.3333Gi/s
ImMemchr_CSTD/134217728                 11.5 ms         11.5 ms           64 bytes_per_second=10.8936Gi/s
ImMemchr_CSTD/1073741824                89.9 ms         91.5 ms            7 bytes_per_second=10.9268Gi/s

ocornut · 2025-02-21T15:46:47Z

Hello,

Thanks for the PR!
Out of curiosity, can you describe what prompted you to undergo this optimization?

Where does the code for the 2 implementations come from? As the problem is relatively well defined and "simple", I am wondering if they may be well known implementations? Do you have links that may describe the approach (I'm not sure I understand all the code, so any further comment may be useful to facilitate possible future maintenance)

It's probably well valuable for fast-forwarding during text display, I'll be benchmarking in that specific scenario.

Kraionix · 2025-02-21T16:53:31Z

Hello.
I am learning low-level optimizations using SIMD on x86-64. I wrote the code from scratch for the sake of practice.
I optimized the SSE and AVX2 implementations of ImMemchr to minimize clock cycles as much as possible.
I took into account loading unaligned data, preloading data into the cache, I also took security into account.
I optimized iteratively until I got to the current code, with maximum optimization.
This is the benchmark source code and binary link.

Kraionix · 2025-02-21T17:05:56Z

I also tested ImGui::TextUnformatted with a 16 MB std::string buffer, in which every 131 characters is \n, the performance on AVX2 ImMemchr increased by about 60%, compared to regular memchr. ImGui::TextUnformatted call was 1.6 ms per frame, became 0.98 ms.

Kraionix added 2 commits February 21, 2025 21:34

Merged feature/simd-memchr as a single commit.

fac2bdd

Merge branch 'master' of https://github.com/Kraionix/imgui

992459d

Kraionix changed the title ~~55% memchr optimization with SIMD~~ 55% memchr optimization with SIMD on x86-64 Feb 21, 2025

ocornut added the optimization label Feb 21, 2025

Kraionix changed the title ~~55% memchr optimization with SIMD on x86-64~~ 55% memchr optimization with SIMD on x86-64 | Macros config SIMD Feb 21, 2025

Kraionix marked this pull request as ready for review February 21, 2025 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

55% memchr optimization with SIMD on x86-64 | Macros config SIMD #8421

55% memchr optimization with SIMD on x86-64 | Macros config SIMD #8421

Kraionix commented Feb 21, 2025 •

edited

Loading

ocornut commented Feb 21, 2025

Kraionix commented Feb 21, 2025

Kraionix commented Feb 21, 2025

55% memchr optimization with SIMD on x86-64 | Macros config SIMD #8421

Are you sure you want to change the base?

55% memchr optimization with SIMD on x86-64 | Macros config SIMD #8421

Conversation

Kraionix commented Feb 21, 2025 • edited Loading

macros config SIMD

SIMD ImMemchr

Benchmark

Benchmark description

System Specifications

CPU: Intel Core i9-10980XE

RAM: 128GB (4x32 GB)

Google benchmark results

ocornut commented Feb 21, 2025

Kraionix commented Feb 21, 2025

Kraionix commented Feb 21, 2025

Kraionix commented Feb 21, 2025 •

edited

Loading