Skip to content

Arm: Speed up -1..1 soft clipping with Neon #396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

agosdahu
Copy link

If the signal exceeds -1..1 then, as error handling, the soft_clip function forces the signal back into -1..1. This is problematic since the search loop to find the next sample exceeding -1..1 is slow. If cheap on the current platform, while doing -2..2 hardclipping we can also detect if the signal never exceeds -1..1, avoiding the need for a second search loop.

@agosdahu
Copy link
Author

This patch was created concurrently with the float-to-int16 optimisation patch and all uplift result was measured on top of it.

When measuring performance uplift, the same methods were used:

  • method A) measuring mono/stereo output at 48k sampling via run_vectors.sh on test vector set attached to repo,
  • method B) running 100-100 stereo run at 48k sampling rate for test files testvector09.bit and testvector10.bit, respectively, but in succession.

The test target was a single Cortex A55 core of a Google Tensor G2 SoC.

Results:

method A clang 18.1.8 gcc 14.2
using /usr/bin/time 3% 6%
using linux perf 5% 8%
method B clang 18.1.8 gcc 14.2
using /usr/bin/time 4% 6%
using linux perf 4% 6%

I also ran the tests described in previously in PR#379 and found no error.

src/opus.c Outdated
#include "opus_private.h"

#ifndef DISABLE_FLOAT_API
OPUS_EXPORT void opus_pcm_soft_clip(float *_x, int N, int C, float *declip_mem)

static void opus_pcm_soft_clip_impl(float *_x, int N, int C, float *declip_mem, int use_arch, int arch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need both use_arch and arch, instead of just setting arch=0 to get the C version?

Copy link
Author

@agosdahu agosdahu Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed use_arch and adjusted relevant code parts.
Arch is always passed and when it's 0, it is going to yield the C implementation.


#if defined(__ARM_NEON)
const int BLOCK_SIZE = 16;
const int FRAME_SIZE = (60000 / sizeof(float) / BLOCK_SIZE) * BLOCK_SIZE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I get this FRAME_SIZE. Looks like a cache optimization thing? In any case, given that Opus packets cannot be longer than 120 ms, I don't think that FRAME_SIZE value can ever be reached unless opus_pcm_soft_clip() gets called explicitly. Or did I miss something here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, yes, it aimed to keep sample processing within cache limits, but we missed the maximum of the opus packet size.
I removed FRAME_SIZE and relevant logic handling it.

src/opus.c Outdated
@@ -135,6 +147,17 @@ OPUS_EXPORT void opus_pcm_soft_clip(float *_x, int N, int C, float *declip_mem)
declip_mem[c] = a;
}
}

void opus_pcm_soft_clip_with_arch(float *_x, int N, int C, float *declip_mem, int arch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem very useful. You might as well just call the impl() function from everywhere, instead of the _with_arch() one that then calls impl().

@jmvalin
Copy link
Member

jmvalin commented Apr 11, 2025

Posted one more comment. Would be good to add some comments about all_within_neg1pos1 and opus_limit2_checkwithin1() to explicitly state what the purpose of the function/variable is. Especially since if you don't look at the ARM version, the code looks a bit silly.

If the signal exceeds -1..1 then, as error handling, the soft_clip
function forces the signal back into -1..1. This is problematic since
the search loop to find the next sample exceeding -1..1 is slow.
If cheap on the current platform, while doing -2..2 hardclipping we
can also detect if the signal never exceeds -1..1, avoiding the need
for a second search loop.
@agosdahu
Copy link
Author

I added a lengthy explanation (I hope you had something like this in mind) and removed the *_with_arch() calls, using only the *_impl() ones.

@jmvalin
Copy link
Member

jmvalin commented Apr 18, 2025

merged

@agosdahu
Copy link
Author

Thank you Jean-Marc!
Much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants