Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option --naming for different line name patterns #184

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

zuphilip
Copy link
Collaborator

@zuphilip zuphilip commented Feb 11, 2017

Usually the lines are named as a hexadecimal 6-digits number where the first two digits are always 01 starting from 010001.bin.png. However, the hexadecimal numbers are sometimes sorting differently and don't fulfill an easy regexp as ´\d+`.

This new option for ocropus-gpageseg will introduce the possibility to change the naming patterns. By default args.naming="hex", we have the same behavior as at the moment. Moreover, there is the option to change to a decimal naming pattern with --naming dec, which will produce 4-digits (decimal) numbers starting from 0001.bin.png. Any other value will be tried to be interpreted as an individual pattern, e.g. --naming ex%05d.

Note: The pseg format will save the page segmentation by labeling the different lines with 6-digits hexadecimal numbers which are then represented as a number. Therefore, ocropus-hocr needed also a change. Are there more consequences when not using 6-digits hexadecimal numbers for naming the lines?

Any other reaction for this idea?

@wrznr
Copy link

wrznr commented Apr 21, 2017

This a very valuable enhancement. However, I would drop the predefined choices and rely on pure format strings only.

@zuphilip
Copy link
Collaborator Author

@wrznr Do you mean that with the naming option only a user-defined pattern can be indicated (deleting the hex and dec cases)? Then, this parameter should be optional and by default the hexadecimal pattern be used as it is currently (to not break any possible dependencies on tools based on ocropus).

Then, these calls would be possible:

ocropus-gpageseg test.bin.png
# by default the pattern "01%04x" is used

ocropus-gpageseg test.bin.png --naming %04d
# pattern with 4 decimal digits

@zuphilip
Copy link
Collaborator Author

zuphilip commented May 8, 2017

There may be more places to attribute different length of file names:

root@zuphilip-VirtualBox:/etc/ocropy# grep -r \?\?\?\?\?\? *
ocrolib/common.py:        return sorted(glob.glob(args[0]+"/????/??????.png"))
ocropus-hocr:xhfiles = python.sum([glob.glob(d+"/??????.xheight") for d in dirs],[])
ocropus-hocr:    lfiles = python.sum([glob.glob(d+"/??????.bin.png") for d in dirs],[])
ocropus-visualize-results:            images = sorted(glob.glob("??????.bin.png"))
ocropus-visualize-results:        for fname in sorted(glob.glob(d+"/??????.txt")):
README.md:    ./ocropus-rpred -Q 4 -m models/fraktur.pyrnn.gz 'book/????/??????.bin.png'
run-coverage:python -m coverage run -p ocropus-rpred -n 'temp/????/??????.bin.png'
run-coverage:python -m coverage run -p ocropus-gtedit html temp/????/??????.bin.png -o temp-correction.html
run-test:ocropus-rpred -n 'temp/????/??????.bin.png'
run-test:ocropus-gtedit html temp/????/??????.bin.png -o temp-correction.html
run-test-ci:ocropus-gtedit html temp/????/??????.bin.png -o temp-correction.html

@zuphilip
Copy link
Collaborator Author

@amitdo What do you want to point me to with these two links?

@amitdo
Copy link
Contributor

amitdo commented Nov 29, 2017

At some point in history, Tom switched from %04d to %04x.

Just digging in ocropus history...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants