Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

55% memchr optimization with SIMD on x86-64 | Macros config SIMD #8421

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Kraionix
Copy link

@Kraionix Kraionix commented Feb 21, 2025

macros config SIMD

Added including of a compiler-specific intrinsics library.

#if !defined(IMGUI_DISABLE_SIMD)
#if defined(_MSC_VER)
#include <intrin.h>
#elif defined(__GNUC__) || defined(__clang__)
#include <x86intrin.h>
#endif
#endif

Updated macros for configuring a build from SIMD to x86-64.

  • IMGUI_DISABLE_SIMD
  • IMGUI_DISABLE_SSE
  • IMGUI_DISABLE_SSE4_2
  • IMGUI_DISABLE_AVX
  • IMGUI_DISABLE_AVX2
  • IMGUI_ENABLE_SSE
  • IMGUI_ENABLE_SSE4_2
  • IMGUI_ENABLE_AVX
  • IMGUI_ENABLE_AVX2
#if (defined __x86_64__ || defined _M_X64) && !defined(IMGUI_DISABLE_SIMD)
#if (defined __SSE__  || (defined(_M_IX86_FP) && (_M_IX86_FP >= 1))) && !defined(IMGUI_DISABLE_SSE)
#define IMGUI_ENABLE_SSE
#endif
#if defined (__SSE4_2__) && !defined(IMGUI_DISABLE_SSE4_2)
#define IMGUI_ENABLE_SSE4_2
#endif
#if (defined __AVX__) && !defined(IMGUI_DISABLE_AVX)
#define IMGUI_ENABLE_AVX
#endif
#if (defined __AVX2__) && !defined(IMGUI_DISABLE_AVX2)
#define IMGUI_ENABLE_AVX2
#endif
#endif

SIMD ImMemchr

Created optimized ImMemchr functions on SSE and AVX2.
Replaced using memchr with ImMemchr.

Benchmark

Benchmark

Benchmark description

Search for all lines of length 131, ending with \n, in a std::string buffer filled with random ASCII characters. Buffer sizes range from 16 MB to 1 GB. Various memchr implementations using SSE and AVX2 are tested for performance.

System Specifications

CPU: Intel Core i9-10980XE

  • Cores Frequency: 4400 MHz (All Cores)
  • Uncore Frequency: 3200 Mhz
  • Cores/Threads: 18 Cores / 36 Threads

RAM: 128GB (4x32 GB)

  • Memory Frequency: 4000 MHz
  • Timings: 18-22-22-42
  • Channels: Quad-channel

Google benchmark results

Run on (36 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x18)
  L1 Instruction 32 KiB (x18)
  L2 Unified 1024 KiB (x18)
  L3 Unified 25344 KiB (x1)
--------------------------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------
ImMemchr_AVX2_PREFETCH/1048576         0.048 ms        0.048 ms        14452 bytes_per_second=20.5284Gi/s
ImMemchr_AVX2_PREFETCH/2097152         0.096 ms        0.096 ms         7467 bytes_per_second=20.2908Gi/s
ImMemchr_AVX2_PREFETCH/16777216        0.799 ms        0.802 ms          896 bytes_per_second=19.4783Gi/s
ImMemchr_AVX2_PREFETCH/134217728        7.28 ms         7.29 ms           90 bytes_per_second=17.1429Gi/s
ImMemchr_AVX2_PREFETCH/1073741824       58.2 ms         58.2 ms           11 bytes_per_second=17.1707Gi/s
ImMemchr_AVX2/1048576                  0.053 ms        0.053 ms        10000 bytes_per_second=18.3824Gi/s
ImMemchr_AVX2/2097152                  0.106 ms        0.107 ms         6400 bytes_per_second=18.1818Gi/s
ImMemchr_AVX2/16777216                 0.924 ms        0.941 ms          747 bytes_per_second=16.6Gi/s
ImMemchr_AVX2/134217728                 9.97 ms         9.79 ms           75 bytes_per_second=12.766Gi/s
ImMemchr_AVX2/1073741824                79.8 ms         78.1 ms            7 bytes_per_second=12.8Gi/s
ImMemchr_SSE_PREFETCH/1048576          0.056 ms        0.056 ms        11200 bytes_per_second=17.5Gi/s
ImMemchr_SSE_PREFETCH/2097152          0.112 ms        0.112 ms         5600 bytes_per_second=17.5Gi/s
ImMemchr_SSE_PREFETCH/16777216         0.918 ms        0.920 ms          747 bytes_per_second=16.9773Gi/s
ImMemchr_SSE_PREFETCH/134217728         8.25 ms         8.12 ms           75 bytes_per_second=15.3846Gi/s
ImMemchr_SSE_PREFETCH/1073741824        65.9 ms         65.3 ms           11 bytes_per_second=15.3043Gi/s
ImMemchr_SSE/1048576                   0.059 ms        0.059 ms        11200 bytes_per_second=16.6667Gi/s
ImMemchr_SSE/2097152                   0.117 ms        0.117 ms         6400 bytes_per_second=16.6667Gi/s
ImMemchr_SSE/16777216                   1.02 ms         1.00 ms          640 bytes_per_second=15.6098Gi/s
ImMemchr_SSE/134217728                  10.4 ms         10.4 ms           75 bytes_per_second=12Gi/s
ImMemchr_SSE/1073741824                 83.7 ms         83.3 ms            9 bytes_per_second=12Gi/s
ImMemchr_CSTD/1048576                  0.068 ms        0.068 ms        11200 bytes_per_second=14.2857Gi/s
ImMemchr_CSTD/2097152                  0.134 ms        0.134 ms         5600 bytes_per_second=14.5833Gi/s
ImMemchr_CSTD/16777216                  1.17 ms         1.17 ms          640 bytes_per_second=13.3333Gi/s
ImMemchr_CSTD/134217728                 11.5 ms         11.5 ms           64 bytes_per_second=10.8936Gi/s
ImMemchr_CSTD/1073741824                89.9 ms         91.5 ms            7 bytes_per_second=10.9268Gi/s

@Kraionix Kraionix changed the title 55% memchr optimization with SIMD 55% memchr optimization with SIMD on x86-64 Feb 21, 2025
@ocornut
Copy link
Owner

ocornut commented Feb 21, 2025

Hello,

Thanks for the PR!
Out of curiosity, can you describe what prompted you to undergo this optimization?

Where does the code for the 2 implementations come from? As the problem is relatively well defined and "simple", I am wondering if they may be well known implementations? Do you have links that may describe the approach (I'm not sure I understand all the code, so any further comment may be useful to facilitate possible future maintenance)

It's probably well valuable for fast-forwarding during text display, I'll be benchmarking in that specific scenario.

@Kraionix
Copy link
Author

Hello.
I am learning low-level optimizations using SIMD on x86-64. I wrote the code from scratch for the sake of practice.
I optimized the SSE and AVX2 implementations of ImMemchr to minimize clock cycles as much as possible.
I took into account loading unaligned data, preloading data into the cache, I also took security into account.
I optimized iteratively until I got to the current code, with maximum optimization.
This is the benchmark source code and binary link.

@Kraionix
Copy link
Author

I also tested ImGui::TextUnformatted with a 16 MB std::string buffer, in which every 131 characters is \n, the performance on AVX2 ImMemchr increased by about 60%, compared to regular memchr. ImGui::TextUnformatted call was 1.6 ms per frame, became 0.98 ms.

@Kraionix Kraionix changed the title 55% memchr optimization with SIMD on x86-64 55% memchr optimization with SIMD on x86-64 | Macros config SIMD Feb 21, 2025
@Kraionix Kraionix marked this pull request as ready for review February 21, 2025 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants