Arm: Speed up -1..1 soft clipping with Neon #396

agosdahu · 2025-03-28T11:13:06Z

If the signal exceeds -1..1 then, as error handling, the soft_clip function forces the signal back into -1..1. This is problematic since the search loop to find the next sample exceeding -1..1 is slow. If cheap on the current platform, while doing -2..2 hardclipping we can also detect if the signal never exceeds -1..1, avoiding the need for a second search loop.

agosdahu · 2025-03-28T13:04:35Z

This patch was created concurrently with the float-to-int16 optimisation patch and all uplift result was measured on top of it.

When measuring performance uplift, the same methods were used:

method A) measuring mono/stereo output at 48k sampling via run_vectors.sh on test vector set attached to repo,
method B) running 100-100 stereo run at 48k sampling rate for test files testvector09.bit and testvector10.bit, respectively, but in succession.

The test target was a single Cortex A55 core of a Google Tensor G2 SoC.

Results:

method A	clang 18.1.8	gcc 14.2
using /usr/bin/time	3%	6%
using linux perf	5%	8%

method B	clang 18.1.8	gcc 14.2
using /usr/bin/time	4%	6%
using linux perf	4%	6%

I also ran the tests described in previously in PR#379 and found no error.

jmvalin · 2025-03-28T17:58:24Z

src/opus.c

 #include "opus_private.h"

 #ifndef DISABLE_FLOAT_API
-OPUS_EXPORT void opus_pcm_soft_clip(float *_x, int N, int C, float *declip_mem)
+
+static void opus_pcm_soft_clip_impl(float *_x, int N, int C, float *declip_mem, int use_arch, int arch)


Why do you need both use_arch and arch, instead of just setting arch=0 to get the C version?

I removed use_arch and adjusted relevant code parts.
Arch is always passed and when it's 0, it is going to yield the C implementation.

jmvalin · 2025-03-29T00:15:44Z

celt/arm/celt_neon_intr.c

+
+#if defined(__ARM_NEON)
+   const int BLOCK_SIZE = 16;
+   const int FRAME_SIZE = (60000 / sizeof(float) / BLOCK_SIZE) * BLOCK_SIZE;


I'm not sure I get this FRAME_SIZE. Looks like a cache optimization thing? In any case, given that Opus packets cannot be longer than 120 ms, I don't think that FRAME_SIZE value can ever be reached unless opus_pcm_soft_clip() gets called explicitly. Or did I miss something here?

Originally, yes, it aimed to keep sample processing within cache limits, but we missed the maximum of the opus packet size.
I removed FRAME_SIZE and relevant logic handling it.

jmvalin · 2025-04-11T00:26:10Z

src/opus.c

@@ -135,6 +147,17 @@ OPUS_EXPORT void opus_pcm_soft_clip(float *_x, int N, int C, float *declip_mem)
      declip_mem[c] = a;
   }
 }
+
+void opus_pcm_soft_clip_with_arch(float *_x, int N, int C, float *declip_mem, int arch)


This doesn't seem very useful. You might as well just call the impl() function from everywhere, instead of the _with_arch() one that then calls impl().

jmvalin · 2025-04-11T00:38:43Z

Posted one more comment. Would be good to add some comments about all_within_neg1pos1 and opus_limit2_checkwithin1() to explicitly state what the purpose of the function/variable is. Especially since if you don't look at the ARM version, the code looks a bit silly.

If the signal exceeds -1..1 then, as error handling, the soft_clip function forces the signal back into -1..1. This is problematic since the search loop to find the next sample exceeding -1..1 is slow. If cheap on the current platform, while doing -2..2 hardclipping we can also detect if the signal never exceeds -1..1, avoiding the need for a second search loop.

agosdahu · 2025-04-17T12:44:50Z

I added a lengthy explanation (I hope you had something like this in mind) and removed the *_with_arch() calls, using only the *_impl() ones.

jmvalin · 2025-04-18T02:22:43Z

merged

agosdahu · 2025-04-22T05:59:35Z

Thank you Jean-Marc!
Much appreciated.

jmvalin reviewed Mar 28, 2025

View reviewed changes

jmvalin reviewed Mar 29, 2025

View reviewed changes

agosdahu force-pushed the neon_softclip branch from afe49c3 to bfd24af Compare April 1, 2025 09:02

jmvalin reviewed Apr 11, 2025

View reviewed changes

agosdahu force-pushed the neon_softclip branch from bfd24af to 02968ed Compare April 17, 2025 12:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm: Speed up -1..1 soft clipping with Neon #396

Arm: Speed up -1..1 soft clipping with Neon #396

agosdahu commented Mar 28, 2025

agosdahu commented Mar 28, 2025

jmvalin Mar 28, 2025

agosdahu Apr 1, 2025 •

edited

Loading

jmvalin Mar 29, 2025

agosdahu Apr 1, 2025

jmvalin Apr 11, 2025

jmvalin commented Apr 11, 2025

agosdahu commented Apr 17, 2025

jmvalin commented Apr 18, 2025

agosdahu commented Apr 22, 2025

Arm: Speed up -1..1 soft clipping with Neon #396

Are you sure you want to change the base?

Arm: Speed up -1..1 soft clipping with Neon #396

Conversation

agosdahu commented Mar 28, 2025

agosdahu commented Mar 28, 2025

jmvalin Mar 28, 2025

Choose a reason for hiding this comment

agosdahu Apr 1, 2025 • edited Loading

Choose a reason for hiding this comment

jmvalin Mar 29, 2025

Choose a reason for hiding this comment

agosdahu Apr 1, 2025

Choose a reason for hiding this comment

jmvalin Apr 11, 2025

Choose a reason for hiding this comment

jmvalin commented Apr 11, 2025

agosdahu commented Apr 17, 2025

jmvalin commented Apr 18, 2025

agosdahu commented Apr 22, 2025

agosdahu Apr 1, 2025 •

edited

Loading