-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modern Greek data issues #160
Comments
https://github.com/tesseract-ocr/langdata_lstm/tree/main/ell contains training text and a word list with the same issues, so the model was trained to produce such results. |
Right, so how can this be fixed? For example I can see in https://github.com/tesseract-ocr/langdata_lstm/blob/main/ell/desired_characters and https://github.com/tesseract-ocr/langdata_lstm/blob/main/ell/ell.unicharset the existence of polytonic characters which should not be there. |
In a first step you could send a pull request for |
OK, I may need some guidance please. I created a fork. So do I simply have to remove non-valid characters from above mentioned files? I also see
I am not sure whether this line should be there going forward. |
Remove or replace, what fits better. |
Thanks. If I replace, I need to know about the structure, for example,
How is the |
You can keep the unicharset file unmodified. A replacement will be created when a new training is run. |
That line tells Tesseract to always use |
Not sure then which files I should change. I don't think I have the knowledge to do any training (I also use Windows). |
So this line should be removed. |
There are 2 major issues with the Greek data.
They tend to produce µ (micro sign) instead of μ (Greek m letter) and despite choosing Modern Greek (ell), some characters have accents that belong to polytonic Greek.
The text was updated successfully, but these errors were encountered: