Enable specification of a config file, and generate hocr output if option set #92
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Guys,
Sorry it took so long to get to this, but I've opened this as a new pull request as it is substantially different from the prior one (#81). Please note the most important thing - this change adds a dependency on nokogiri.
First, I took the suggestion to make the command line option one to set a config file. If a config file is set, we look for the variable that enables hocr generation in the file itself. One thing to note is that tesseract allows you to use "+[config]" and it will look for the config file in a well-known location. I decided to just require the full path to be explicit when this option is specified, to avoid install issues.
When hocr generation is on, tesseract appears to only produce html output, not text. So, I've added some routines to generate a text file as well. Additionally, the original hocr output is annotated so that the word tags have two new data attributes set, "data-start" and "data-stop," which are the character start / stop positions of that word.
Lastly, note that the cleaning is a little simplified. I run the text from the xml node through the method that checks for a garbage word, but I didn't do anything fancy looking for too much whitespace. I figured this was good enough for now.