Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable specification of a config file, and generate hocr output if option set #92

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jhosteny
Copy link

Guys,

Sorry it took so long to get to this, but I've opened this as a new pull request as it is substantially different from the prior one (#81). Please note the most important thing - this change adds a dependency on nokogiri.

First, I took the suggestion to make the command line option one to set a config file. If a config file is set, we look for the variable that enables hocr generation in the file itself. One thing to note is that tesseract allows you to use "+[config]" and it will look for the config file in a well-known location. I decided to just require the full path to be explicit when this option is specified, to avoid install issues.

When hocr generation is on, tesseract appears to only produce html output, not text. So, I've added some routines to generate a text file as well. Additionally, the original hocr output is annotated so that the word tags have two new data attributes set, "data-start" and "data-stop," which are the character start / stop positions of that word.

Lastly, note that the cleaning is a little simplified. I run the text from the xml node through the method that checks for a garbage word, but I didn't do anything fancy looking for too much whitespace. I figured this was good enough for now.

…f hocr output is enabled. If so, we also generate the text file from the hocr output, and back annotate the hocr output with word positions in HTML data attributes.
@knowtheory
Copy link
Member

bump for the PDF Liberation Hackathon

@tpendragon
Copy link

What's the progress on this? Generating hOCR would be really useful to me.

@CaseKey
Copy link

CaseKey commented May 17, 2015

I could really use this as well. Using the outdated LegalSifter hocr branch gives errors regarding OpenOffice in the production version.

@be42day
Copy link

be42day commented Jun 22, 2020

use tesseract ocr
type in cmd:
tesseract "image_file" "hocr_file" -c tessedit_create_hocr=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants