-
Notifications
You must be signed in to change notification settings - Fork 663
Arm: Speed up -1..1 soft clipping with Neon #396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This patch was created concurrently with the float-to-int16 optimisation patch and all uplift result was measured on top of it. When measuring performance uplift, the same methods were used:
The test target was a single Cortex A55 core of a Google Tensor G2 SoC. Results:
I also ran the tests described in previously in PR#379 and found no error. |
src/opus.c
Outdated
#include "opus_private.h" | ||
|
||
#ifndef DISABLE_FLOAT_API | ||
OPUS_EXPORT void opus_pcm_soft_clip(float *_x, int N, int C, float *declip_mem) | ||
|
||
static void opus_pcm_soft_clip_impl(float *_x, int N, int C, float *declip_mem, int use_arch, int arch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need both use_arch and arch, instead of just setting arch=0 to get the C version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed use_arch and adjusted relevant code parts.
Arch is always passed and when it's 0, it is going to yield the C implementation.
celt/arm/celt_neon_intr.c
Outdated
|
||
#if defined(__ARM_NEON) | ||
const int BLOCK_SIZE = 16; | ||
const int FRAME_SIZE = (60000 / sizeof(float) / BLOCK_SIZE) * BLOCK_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I get this FRAME_SIZE. Looks like a cache optimization thing? In any case, given that Opus packets cannot be longer than 120 ms, I don't think that FRAME_SIZE value can ever be reached unless opus_pcm_soft_clip() gets called explicitly. Or did I miss something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, yes, it aimed to keep sample processing within cache limits, but we missed the maximum of the opus packet size.
I removed FRAME_SIZE
and relevant logic handling it.
src/opus.c
Outdated
@@ -135,6 +147,17 @@ OPUS_EXPORT void opus_pcm_soft_clip(float *_x, int N, int C, float *declip_mem) | |||
declip_mem[c] = a; | |||
} | |||
} | |||
|
|||
void opus_pcm_soft_clip_with_arch(float *_x, int N, int C, float *declip_mem, int arch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem very useful. You might as well just call the impl() function from everywhere, instead of the _with_arch() one that then calls impl().
Posted one more comment. Would be good to add some comments about all_within_neg1pos1 and opus_limit2_checkwithin1() to explicitly state what the purpose of the function/variable is. Especially since if you don't look at the ARM version, the code looks a bit silly. |
If the signal exceeds -1..1 then, as error handling, the soft_clip function forces the signal back into -1..1. This is problematic since the search loop to find the next sample exceeding -1..1 is slow. If cheap on the current platform, while doing -2..2 hardclipping we can also detect if the signal never exceeds -1..1, avoiding the need for a second search loop.
I added a lengthy explanation (I hope you had something like this in mind) and removed the *_with_arch() calls, using only the *_impl() ones. |
merged |
Thank you Jean-Marc! |
If the signal exceeds -1..1 then, as error handling, the soft_clip function forces the signal back into -1..1. This is problematic since the search loop to find the next sample exceeding -1..1 is slow. If cheap on the current platform, while doing -2..2 hardclipping we can also detect if the signal never exceeds -1..1, avoiding the need for a second search loop.