Math: Inline function sofm_lut_sin_fixed_16b() for performance#9798
Math: Inline function sofm_lut_sin_fixed_16b() for performance#9798singalsu wants to merge 1 commit intothesofproject:mainfrom
Conversation
This patch inlines the function sofm_lut_sin_fixed_16b() and moves it to header file lut_trig.h. The lookup table is kept in lut_trig.c and made global. The DRC component use a lot the sine function (the fast lookup tables version). The function seems to not get improvement from HiFi intrinsics rewrite but making it inline improves DRC performance in MTL platform by 0.54 MCPS, from 12.62 MCPS to 12.08 MCPS. In Multiband-DRC the saving multiplies by number of bands, e.g. 1.58 MCPS saving with three bands. Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
7bc3b56 to
5412a28
Compare
| delta = s1 - s0; /* Q1.16 */ | ||
| sine = s0 + q_mults_32x32(frac, delta, Q_SHIFT_BITS_64(31, 16, 16)); /* Q1.16 */ | ||
| return sat_int16((sine + 1) >> 1); /* Round to Q1.15 */ | ||
| } |
There was a problem hiding this comment.
with this every call to sofm_lut_sin_fixed_16b() inlines sofm_sine_lookup_16b() twice. The former is called from drc_sin_fixed(), which is also an inline function in drc_math.h. That one is called from C code 4 times from C, HiFi3 and HiFi4 DRC versions. So that should make the resulting image (or the DRC module) somewhat larger. @singalsu have you compared sizes? You could also try to only inline one of them, wondering how much performance improvement would that give. Also, you could convert lines 55-56 to a 2-iteration loop, which would reduce the size a bit, unless the compiler decides to unroll that loop.
In general, I'd guess, that we could make similar performance improvements by identifying and moving to headers all the functions, called when processing data
This patch inlines the function sofm_lut_sin_fixed_16b() and moves it to header file lut_trig.h. The lookup table is kept in lut_trig.c and made global.
The DRC component use a lot the sine function (the fast lookup tables version). The function seems to not get improvement from HiFi intrinsics rewrite but making it inline improves DRC performance by 0.54 MCPS, from 12.62 MCPS to 12.08 MCPS.