Skip to content

NEON-optimize 5-3 IDWT#1630

Merged
rouault merged 1 commit into
uclouvain:masterfrom
nico:neon-reversible
Apr 24, 2026
Merged

NEON-optimize 5-3 IDWT#1630
rouault merged 1 commit into
uclouvain:masterfrom
nico:neon-reversible

Conversation

@nico

@nico nico commented Apr 12, 2026

Copy link
Copy Markdown
Contributor

Takes bin/bench_dwt from 1.618 s to 0.432 s on my system.


Sadly, despite being a much bigger win in bench_dwt time, it's a much smaller end-to-end time improvement than #1629, as time spent in idwt is smaller than time spent in T1 / MQC decoding for lossless files (as there's more data). But still, it seems nice to have fast 5-3 IDWT anyways.

For a large lossless file I have, it takes bin/opj_decompress -i balloon-reversible.jp2 -o test.ppm -threads 0 from printing "decode time: 588 ms" to "decode time: 559 ms", around a 5% speedup for decoding. It's not nothing, but much less than the other PR.

I created that input file with bin/opj_compress -i image.ppm -o balloon-reversible.jp2 -M 1, where image.ppm is balloon.jp2 in decompressed (and balloon.jp2 is the usual jp2 balloon test file).

Takes `bin/bench_dwt` from 1.618 s to 0.432 s on my system.
@nico

nico commented Apr 24, 2026

Copy link
Copy Markdown
Contributor Author

For this one, I also verified that ctest --parallel has the same 50 failures it was without the PR, and I checked that dwt.c compiles fine with --target=armv7a-linux-gnueabihf --sysroot ~/src/chrome/src/build/linux/debian_bullseye_armhf-sysroot (to test 32-bit arm).

@rouault rouault merged commit 530bebd into uclouvain:master Apr 24, 2026
14 checks passed
@nico nico deleted the neon-reversible branch April 25, 2026 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants