Run LSTM recognition in multiple threads #4275

jkarthic · 2024-06-27T13:12:55Z

Init time option lstm_num_threads should be used to set the number of LSTM threads. This will ensure that word recognition can run independently in multiple threads, thus effectively utilizing multi-core processors.

Following are my test results for a sample screenshot.
CPU : Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
OS : WIndows
Compiler : MSVC 19.38.33130.0 (Installed from Visual Studio 2022)
Model: eng.traineddata from tessfast
PSM: 6

Total time taken for Recognize API call, Built without OpenMP
With lstm_num_threads=1, total time taken = 3.95 seconds
With lstm_num_threads=4, total time taken = 1.4 seconds

On the other hand, here are the numbers with OpenMP
OMP_THREAD_LIMIT not set, total time taken = 3.59 seconds
OMP_THREAD_LIMIT=4, total time taken = 3.57 seconds
OMP_THREAD_LIMIT=1, total time taken = 4.19 seconds

As we can observe, this branch with lstm_num_threads set as 4, performs way better than the openmp multithreading supported currently. Setting lstm_num_threads equal to the number of cores in the processor will give the best performance.

Init time option lstm_num_threads should be used to set the number of LSTM threads

stweil · 2024-06-27T13:23:57Z

src/ccmain/tesseractclass.cpp

@@ -477,7 +481,10 @@ Tesseract::~Tesseract() {
  for (auto *lang : sub_langs_) {
    delete lang;
  }
-  delete lstm_recognizer_;
+  for (int i = 0; i < lstm_recognizers_.size(); ++i) {


i should use the same data type as the return value of size().

for (auto &&r : lstm_recognizers_) delete r;

Thanks for your review comments. I have addressed this now.

egorpugin · 2024-06-27T20:39:47Z

src/ccmain/tessedit.cpp

-      lstm_recognizer_ = new LSTMRecognizer(language_data_path_prefix.c_str());
-      ASSERT_HOST(lstm_recognizer_->Load(this->params(), lstm_use_matrix ? language : "", mgr));
+      for (int i = 0; i < lstm_num_threads; ++i) {
+        lstm_recognizers_.push_back(new LSTMRecognizer(language_data_path_prefix.c_str()));


Fine for me.
But maybe it is possible to switch to unique_ptrs here?

Upd.:
Since this is new code std::vector<LSTMRecognizer *> lstm_recognizers_;, could you try to use uniq ptrs here?

I was just following the existing design pattern of tesseract of using LSTMRecognizer * and performing new and delete for allocation and deallocation.
As per the C++ best practice, we should ideally use shared_ptrs everywhere for accessing LSTMRecognizer, to avoid any dangling ptrs. But that will involve a lot of refactoring, deviating away from the main purpose of this PR. Hence I just followed the existing tesseract practice of using LSTMRecognizer *.

egorpugin · 2024-06-27T20:42:37Z

src/ccmain/control.cpp

+  auto segment_start = words->begin() + segment_size;
+  for (int i = 1; i < lstm_num_threads; ++i) {
+      auto segment_end = (i == lstm_num_threads - 1) ? words->end() : segment_start + segment_size;
+      futures.push_back(std::async(std::launch::async, &Tesseract::RecogWordsSegment,


What about static thread pool here?
If code is called multiple times, how is std::async performance compared to static thread pool?
Considering async will create a new thread each time.

Good question.
As stweil has clarified in his latest comment, this multiple lstm threads is not meant for mass production. This feature is probably meant for consumer-end devices running a single page OCR once in a while, but with least possible latency. In such cases thread creation/deletion is not a big overhead. But if we really come across use-cases where thread creation/deletion overhead becomes significant, we could look at replacing this with a thread pool at that point.

The default case (lstm_num_threads == 1) must not create a new thread.

yes, the default case (lstm_num_threads == 1) does not create a new thread. As you see, the loop starts with i = 1, and hence does not execute when lstm_num_threads == 1.

stweil · 2024-06-27T21:13:29Z

Many thanks for this nice contribution.

With this pull request users have the choice of using the new argument --lstm-num-threads N or setting the new parameter with -c lstm-num-threads=N. Do we need both ways? If a command line argument is desired (like in the case of --dpi), I think that there might be more user friendly variants. Although --lstm-num-thread describes the technical implementation correctly, it is a lengthy argument which maybe requires too much explanation. Do we expect more --xxx-num-thread arguments in the future? Or would --threads be sufficient?

Maybe we could also extend the command line syntax to have --PARAMETER VALUE as an alternative for -c PARAMETER=VALUE for any Tesseract parameter.

stweil · 2024-06-27T21:17:28Z

Setting lstm_num_threads equal to the number of cores in the processor will give the best performance.

Just to clarify this statement: it's only true for the OCR of a single page. For mass production it is still better to run (number of cores) parallel Tesseract processes because then all processing steps use 100 % of the available resources.

jkarthic · 2024-06-28T05:43:24Z

Many thanks for this nice contribution.

And many thanks to you for reviewing this patiently.

With this pull request users have the choice of using the new argument --lstm-num-threads N or setting the new parameter with -c lstm-num-threads=N. Do we need both ways?

This lstm_num_threads is a init time parameter. The LSTMRecognizer instances are created during init. Setting this new parameter with -c lstm-num-threads=N will not work, as it is setting the variable after the init is done.

Although --lstm-num-thread describes the technical implementation correctly, it is a lengthy argument which maybe requires too much explanation. Do we expect more --xxx-num-thread arguments in the future? Or would --threads be sufficient?

When I tested tesseract with a psm of 3(which is the default for tesseract.exe), page segmentation was taking significantly more time than the actual LSTM recognition. For example, in one of my tests, page segmentation was taking ~7 seconds, and lstm was taking ~3 seconds, taking the total to ~10 seconds. Users running with default psm parameter should not expect that the entire 10 seconds will be run in multiple threads. In this case, the major part of ~7 seconds will run single threaded and only a minor part of ~3 seconds will be multi threaded. Hence I thought adding a longer name is setting the user expectation right, that only a portion of tesseract will be running multithreaded.
Also there are other numthreads variables related to OpenMP, inside the code which were named generically such as kNumThreads, __num_threads and num_threads. Naming this as lstm_num_threads also differentiates this as a seperate variable, not to be confused with OpenMP num threads.

jkarthic · 2024-06-28T05:44:59Z

Setting lstm_num_threads equal to the number of cores in the processor will give the best performance.

Just to clarify this statement: it's only true for the OCR of a single page. For mass production it is still better to run (number of cores) parallel Tesseract processes because then all processing steps use 100 % of the available resources.

Totally agreed. This is meant for latency-sensitive real-time applications, with ocr probably running in the consumer's device itself.

Changed the WERD_RES linked link to use shared pointers instead of raw pointers. This is needed so that even if one thread deletes a WERD_RES object, other thread's which needs to iterate thru them can still access it safely. In terms of LSTM processing, only one threads processes one WERD_RES. This change is needed as all the threads can iterate thru due to single linked list data structure.

jkarthic · 2024-07-05T14:56:20Z

@stweil
I observed a crash issue in the earlier code due to WERD_RES objects freed by one thread was used by another thread for iterating thru the WERD_RES singly linked list.
To fix the above above issue, I have modified WERD_RES linked list to use shared pointer instead of raw pointers, so that lifetime of the objects are managed automatically. I have also added mutex protections around the PAGE_RES_IT functions that modify this list in order to avoid race conditions.
Please take a look at the modifications whenever you get some time for this.

egorpugin

Now it is much much worse.

Remove std::async.

egorpugin · 2024-07-05T15:40:18Z

I suggest to use previous version as base.

jkarthic · 2024-07-05T16:17:01Z

Now it is much much worse.

@egorpugin I am not sure, if I understand your comment here. Could you please elaborate what is "much much worse"?

egorpugin · 2024-07-05T16:41:17Z

@egorpugin I am not sure, if I understand your comment here. Could you please elaborate what is "much much worse"?

More complex code.
A lot of sync.
Much harder to review.
Most likely a 'no go' in current state.

You need to provide a very detailed description of:

algorithm. How it works? Is it possible to sync less?
changes in files. I see new types, mutex locks in some existing functions. See example how this can be described from gcc commit messages, e.g. gcc-mirror/gcc@5185274 (bottom part of the commit message)

burningfireplace · 2024-09-25T12:18:47Z

I tested this PR on my Mac (M1 chip). I have a few observations to share:

The latest commit (d1eed6a) does not compile successfully on my system. I encountered multiple errors related to ELIST_ITERATOR_T, BLOBNBOX_IT, and other classes. It seems some changes may be missing or incomplete in this commit.
The initial commit (6a2e239) does compile and produces a working binary. Using the new --lstm-num-threads option, I was able to achieve a ~2x speedup (compared to 5.4.1 + OpenMP) by finding the optimal N value for my system.
However, I noticed an issue with language/script recognition when using multiple languages. For example:
- Using -l "eng+ukr" tries to recognize Cyrillic chars as Latin chars
- Using -l "ukr+eng" vice-versa: incorrectly recognizes Cyrillic in places with Latin chars

This behavior differs from Tesseract 5.4.1, which produces correct output for the same inputs in both cases.

I hope this feedback is helpful.

jkarthic · 2024-09-25T15:32:46Z

@burningfireplace

Thanks for trying it out and providing a detailed analysis. Here is my reply.

I was using windows while developing this. Never tested this on Mac, as mac had apple's inbuilt OCR Vision API, which in my experience performs better than tesseract. But that's no reason for breaking the build on Mac. I will try to fix the compile issues on Mac, when I get some time.
Nice to know, this is improving speed on Mac as well.
The initial commit had a few race condition bugs that might be causing this. Once compile issues are resolved, I would encourage you to try the latest commit once again. Ideally the latest commit should provide identical output as the single threaded version.

Run LSTM recognition in multiple threads

6a2e239

Init time option lstm_num_threads should be used to set the number of LSTM threads

stweil reviewed Jun 27, 2024

View reviewed changes

Addressed review comments while deleting lstm recognizers vector

b7d6739

jkarthic requested review from stweil and egorpugin June 27, 2024 16:38

egorpugin reviewed Jun 27, 2024

View reviewed changes

egorpugin approved these changes Jun 27, 2024

View reviewed changes

egorpugin reviewed Jun 27, 2024

View reviewed changes

egorpugin requested changes Jul 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run LSTM recognition in multiple threads #4275

Run LSTM recognition in multiple threads #4275

jkarthic commented Jun 27, 2024

stweil Jun 27, 2024

egorpugin Jun 27, 2024

jkarthic Jun 27, 2024

egorpugin Jun 27, 2024 •

edited

Loading

jkarthic Jun 28, 2024

egorpugin Jun 27, 2024 •

edited

Loading

jkarthic Jun 28, 2024

stweil Jun 28, 2024 •

edited

Loading

jkarthic Jun 28, 2024

stweil commented Jun 27, 2024

stweil commented Jun 27, 2024

jkarthic commented Jun 28, 2024

jkarthic commented Jun 28, 2024

jkarthic commented Jul 5, 2024

egorpugin left a comment

egorpugin commented Jul 5, 2024

jkarthic commented Jul 5, 2024

egorpugin commented Jul 5, 2024 •

edited

Loading

burningfireplace commented Sep 25, 2024

jkarthic commented Sep 25, 2024

Run LSTM recognition in multiple threads #4275

Are you sure you want to change the base?

Run LSTM recognition in multiple threads #4275

Conversation

jkarthic commented Jun 27, 2024

stweil Jun 27, 2024

Choose a reason for hiding this comment

egorpugin Jun 27, 2024

Choose a reason for hiding this comment

jkarthic Jun 27, 2024

Choose a reason for hiding this comment

egorpugin Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

jkarthic Jun 28, 2024

Choose a reason for hiding this comment

egorpugin Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

jkarthic Jun 28, 2024

Choose a reason for hiding this comment

stweil Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

jkarthic Jun 28, 2024

Choose a reason for hiding this comment

stweil commented Jun 27, 2024

stweil commented Jun 27, 2024

jkarthic commented Jun 28, 2024

jkarthic commented Jun 28, 2024

jkarthic commented Jul 5, 2024

egorpugin left a comment

Choose a reason for hiding this comment

egorpugin commented Jul 5, 2024

jkarthic commented Jul 5, 2024

egorpugin commented Jul 5, 2024 • edited Loading

burningfireplace commented Sep 25, 2024

jkarthic commented Sep 25, 2024

egorpugin Jun 27, 2024 •

edited

Loading

egorpugin Jun 27, 2024 •

edited

Loading

stweil Jun 28, 2024 •

edited

Loading

egorpugin commented Jul 5, 2024 •

edited

Loading