Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove excessive floating-point divides #4312

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

heshpdx
Copy link
Contributor

@heshpdx heshpdx commented Sep 3, 2024

Loft the loop-invariant divide outside the hot loops, and/or invert the variable to turn FDIV into FMUL.

Most CPUs are slower at FP division compared to FP multiplication. This should provide some uplift in performance. I was testing with the integer models.

Loft the loop-invariant divide outside the hot loop, and/or
invert the variable to turn FDIV into FMUL.
src/textord/pithsync.cpp Outdated Show resolved Hide resolved
src/textord/pithsync.cpp Outdated Show resolved Hide resolved
src/textord/pithsync.cpp Outdated Show resolved Hide resolved
src/textord/pithsync.cpp Outdated Show resolved Hide resolved
src/lstm/networkio.cpp Outdated Show resolved Hide resolved
src/lstm/networkio.cpp Outdated Show resolved Hide resolved
@stweil
Copy link
Contributor

stweil commented Sep 3, 2024

Do you have timing values from your tests?

@heshpdx
Copy link
Contributor Author

heshpdx commented Sep 3, 2024

Do you have timing values from your tests?

It will be CPU specific, but I see +10% on my Ampere Altra.

@stweil
Copy link
Contributor

stweil commented Sep 3, 2024

It will be CPU specific, but I see +10% on my Ampere Altra.

That's a very significant improvement! I wonder how this ARM64 cpu compares to Intel / AMD cpus with Tesseract recognition and training.

@heshpdx
Copy link
Contributor Author

heshpdx commented Sep 3, 2024

If there are standard tests that you run, please do share the results.

I was using -l deu --tessdata-dir ./tessdata_orig --oem 0 and -l Arabic --tessdata-dir ./tessdata_fast. The -l rus --tessdata-dir ./tessdata_orig --oem 2 did not show much improvement.

@stweil
Copy link
Contributor

stweil commented Sep 3, 2024

Does Ampere Altra offer additional opcodes which could be used to make Tesseract's neural network code faster? We currently use Neon code for ARM64 (see src/arch/*neon.cpp).

@stweil
Copy link
Contributor

stweil commented Sep 3, 2024

If there are standard tests that you run, please do share the results.

You can run make check (after installing required packages, repositories and git submodules) and compare the times for the single tests. lstm_test is the test with the longest execution time.

Here are my results on a Mac mini M2 for running time ./lstm_test:

# git main branch, extract from lstm_test.log and log message from `time`.
[==========] 11 tests from 1 test suite ran. (278833 ms total)
./lstm_test  274,78s user 2,87s system 99% cpu 4:38,88 total

# git main branch with PR applied, extract from lstm_test.log and log message from `time`.
[==========] 11 tests from 1 test suite ran. (276981 ms total)
./lstm_test  273,60s user 2,50s system 99% cpu 4:37,03 total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants