require an update of japanese trained data #119

horohoro · 2019-04-05T03:06:28Z

I tried to run the OCR on an image with ¥ symbols and the engine was totally unable to match any of them. It usually translated them into "\ く".
The ¥ were "" for a UNC path (because on Japanese Windows, all the \ are replaced by ¥)

Also, all the number (except 0) were translated into their circled version.
For example:
1 -> ①
2 -> ②
3 -> ③
...
Japanese uses circled number some times (often compare to the rest of the world) but not that often. Number should still be translated into their normal form.

I think that these issues come from the training data that did not include ¥ and has as input number that were circled in the expected result.

I am quite a noob in ML and I do not know if I can and how to extract source data from traineddata file.

amitdo · 2019-04-05T04:42:29Z

You can try the Japanese.traineddata from the best or fast repo.

gitvophu · 2020-04-09T11:20:10Z

You can try the Japanese.traineddata from the best or fast repo.

Do you mean jpn.traineddata? isn't it?

amitdo · 2020-04-09T14:19:07Z

https://github.com/tesseract-ocr/tessdata_fast/tree/master/script

https://github.com/tesseract-ocr/tessdata_best/tree/master/script

They seem to work better as suggested here: tesseract-ocr/tessdata#119 Refs: #973

wallace11 mentioned this issue Jul 31, 2021

Issue with Japanese and numbers (digits) eikek/docspell#973

Closed

eikek added a commit to eikek/docspell that referenced this issue Aug 13, 2021

Use different japanese train files for tesseract

326cf1c

They seem to work better as suggested here: tesseract-ocr/tessdata#119 Refs: #973

eikek mentioned this issue Aug 13, 2021

Use different japanese train files for tesseract eikek/docspell#1005

Merged

amitdo mentioned this issue Nov 17, 2021

Numbers in Japanese use the wrong Unicode characters in output tesseract-ocr/tesseract#3646

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

require an update of japanese trained data #119

require an update of japanese trained data #119

horohoro commented Apr 5, 2019

amitdo commented Apr 5, 2019

gitvophu commented Apr 9, 2020

amitdo commented Apr 9, 2020

require an update of japanese trained data #119

require an update of japanese trained data #119

Comments

horohoro commented Apr 5, 2019

amitdo commented Apr 5, 2019

gitvophu commented Apr 9, 2020

amitdo commented Apr 9, 2020