OCR-D wrapper for prima-page-to-pdf
Contents:
This package offers OCR-D compliant workspace processors for conversion of OCR data represented in METS (on the document level) and PAGE or ALTO (on the page level) to PDF.
It transforms both the scan image (facsimile) and annotations (text overlay), optionally drawing polygon outlines for text regions / lines / words / glyphs.
Optionally validates the structural annotation and fixes its coordinates before attempting conversion.
The text layer is generated from the textual annotation on the configured level of the structural hierarchy (region / line / word / glyph). It is rendered with a configurable font (which is useful to make sure all codepoints are covered by adequate glyphs, esp. in historic prints and manuscripts).
The page labels can be configured to use various attributes from the physical pages of the METS.
A table of contents will be added according to the labels of the
recursive mets:div
logical structure.
This is the best option if you want to run the software in a container.
You need to have Docker
docker pull ocrd/pagetopdf
To run with docker:
docker run -v path/to/workspaces:/data ocrd/pagetopdf ocrd-pagetopdf ...
This is the best option if you want to use the stable, released version.
After installing Python and Java, simply do:
pip install ocrd_pagetopdf
Use this option if you want to change the source code or install the latest, unpublished changes.
We strongly recommend to use venv.
After installing make
, assuming you are on a Debian/Ubuntu OS, you can do:
sudo make deps-ubuntu
Otherwise, simulate this step and install requirements with equivalent actions on your system:
make -n deps-ubuntu
...
Finally, to install the Python package, do:
make install
# or equivalently:
pip install .
The command-line interface ocrd-pagetopdf
conforms to OCR-D processor specifications.
Assuming you have an OCR-D workspace in your current working directory, simply do:
ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -P textequiv_level word
This will run the script and create PDF files for each page with a text layer based on word-level annotations.
In order to create an additional multipage file for the entire document, named merged.pdf
,
concatenating the single page PDFs in physical order and with page labels and contents, do:
ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -P textequiv_level word -P multipage merged
In case your workspace does not contain fulltext in PAGE format, but ALTO, there is a dedicated
processor CLI ocrd-altotopdf
, with some limitations compared to the former:
- You need to manually select the fileGrp providing the images which match the annotation coordinates, passing it as second input fileGrp. (The image references are required by PAGE, but not by ALTO.)
- The images are not generated on-the-fly according to all annotations (from existing
AlternativeImage
s, or by cropping via coordinates into the higher-level image, and deskewing when applicable), and not chosen viainput_feature_selector
/input_feature_filter
mechanism. Instead, only the original images can be used here. - The annotations are not tested comprehensively regarding validity and consistency of coordinates and then repaired. Instead, only superficial checks and repairs can be applied (like negative coordinates).
Assuming you have a workspace representing a typical DFG-conforming METS,
with FULLTEXT
for ALTO and DEFAULT
for the original images, do:
ocrd-altotopdf -I FULLTEXT,DEFAULT -O PDF-FILEGRP -P textequiv_level word -P multipage merged
For more options and explanations, see below.
OCR-D CLI
Usage: ocrd-pagetopdf [worker|server] [OPTIONS] Convert text and layout annotations from PAGE to PDF format (overlaying original image with text layer and polygon outlines) > Converts all pages of the document to PDF > For each page, open and deserialize PAGE input file and its > respective image. Then extract a derived image of the (cropped, > deskewed, binarized...) page, with features depending on > ``image_feature_selector`` (a comma-separated list of required image > features, cf. :py:func:`ocrd.workspace.Workspace.image_from_page`) > and ``image_feature_filter`` (a comma-separated list of forbidden > image features). > Next, generate a temporary PAGE output file for that very image > (adapting all coordinates if necessary). If ``negative2zero`` is > set, validate and repair invalid or inconsistent coordinates. > Convert the PAGE/image pair with PRImA PageToPdf, applying > - ``textequiv_level`` (i.e. `-text-source`) to retrieve a text layer, if set; > - ``outlines`` to draw boundary polygons, if set; > - ``font`` accordingly. > Copy the resulting PDF file to the output file group and reference > it in the METS. > Finally, if ``multipage`` is set, then concatenate all generated > files to a multi-page PDF file, setting ``pagelabels`` accordingly, > as well as PDF metadata and bookmarks. Reference it with > ``multipage`` as ID in the output file group, too. If > ``multipage_only`` is also set, then remove the single-page PDF > files afterwards. Subcommands: worker Start a processing worker rather than do local processing server Start a processor server rather than do local processing Options for processing: -m, --mets URL-PATH URL or file path of METS to process [./mets.xml] -w, --working-dir PATH Working directory of local workspace [dirname(URL-PATH)] -I, --input-file-grp USE File group(s) used as input -O, --output-file-grp USE File group(s) used as output -g, --page-id ID Physical page ID(s) to process instead of full document [] --overwrite Remove existing output pages/images (with "--page-id", remove only those). Short-hand for OCRD_EXISTING_OUTPUT=OVERWRITE --debug Abort on any errors with full stack trace. Short-hand for OCRD_MISSING_OUTPUT=ABORT --profile Enable profiling --profile-file PROF-PATH Write cProfile stats to PROF-PATH. Implies "--profile" -p, --parameter JSON-PATH Parameters, either verbatim JSON string or JSON file path -P, --param-override KEY VAL Override a single JSON object key-value pair, taking precedence over --parameter -U, --mets-server-url URL URL of a METS Server for parallel incremental access to METS If URL starts with http:// start an HTTP server there, otherwise URL is a path to an on-demand-created unix socket -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE] Override log level globally [INFO] --log-filename LOG-PATH File to redirect stderr logging to (overriding ocrd_logging.conf). Options for information: -C, --show-resource RESNAME Dump the content of processor resource RESNAME -L, --list-resources List names of processor resources -J, --dump-json Dump tool description as JSON -D, --dump-module-dir Show the 'module' resource location path for this processor -h, --help Show this message -V, --version Show version Parameters: "image_feature_selector" [string - ""] comma-separated list of required image features (e.g. binarized,despeckled,cropped,deskewed,rotated-90) "image_feature_filter" [string - ""] comma-separated list of forbidden image features (e.g. binarized,despeckled,cropped,deskewed,rotated-90) "font" [string - ""] Font file to be used in PDF file. If unset, AletheiaSans.ttf is used. (Make sure to pick a font which covers all glyphs!) "outlines" [string - ""] What segment hierarchy to draw coordinate outlines for. If unset, no outlines are drawn. Possible values: ["", "region", "line", "word", "glyph"] "textequiv_level" [string - ""] What segment hierarchy level to render text output from. If unset, no text is rendered. Possible values: ["", "region", "line", "word", "glyph"] "negative2zero" [boolean - false] Repair invalid or inconsistent coordinates before trying to convert. "ext" [string - ".pdf"] Output filename extension "multipage" [string - ""] Merge all PDFs into one multipage file. The value is used as METS file ID and file basename for the PDF. "multipage_only" [boolean - false] When producing a `multipage`, do not add single-page files into the output fileGrp (but use a temporary directory for them). "pagelabel" [string - "pageId"] Parameter for 'multipage': Set the labels used as page outlines. - 'pageId': physical page ID, - 'pagenumber': use consecutive numbers, - 'pagelabel': use '@ORDERLABEL - @LABEL', - 'basename': use the name of the input file, - 'local_filename': use the href relative path of the input file, - 'url': use the href URL of the input file, - 'ID': use the file ID of the input file Possible values: ["pagenumber", "pagelabel", "pageId", "basename", "basename_without_extension", "local_filename", "ID", "url"] "script-args" [string - ""] Extra arguments to PageToPdf (see https://github.com/PRImA-Research- Lab/prima-page-to-pdf)
OCR-D CLI
Usage: ocrd-altotopdf [worker|server] [OPTIONS] Convert text and layout annotations from ALTO to PDF format (overlaying original image with text layer and polygon outlines) > Converts all pages of the document to PDF > For each page, find the ALTO input file in the first fileGrp, > together with the image input file in the second fileGrp. > Then convert ALTO to PAGE with PRImA PageConverter in a temporary > location. > Next convert the PAGE/image pair with PRImA PageToPdf in a temporary location, > applying > - ``textequiv_level`` (i.e. `-text-source`) to retrieve a text layer, if set; > - ``outlines`` to draw boundary polygons, if set; > - ``font`` accordingly; > - ``negative2zero`` (i.e. `-neg-coords toZero`) to repair negative coordintes. > Copy to the resulting PDF file to the output file group and > reference it in the METS. > Finally, if ``multipage`` is set, then concatenate all generated > files to a multi-page PDF file, setting ``pagelabels`` accordingly, > as well as PDF metadata and bookmarks. Reference it with > ``multipage`` as ID in the output fileGrp, too. If > ``multipage_only`` is also set, then remove the single-page PDF > files afterwards. Subcommands: worker Start a processing worker rather than do local processing server Start a processor server rather than do local processing Options for processing: -m, --mets URL-PATH URL or file path of METS to process [./mets.xml] -w, --working-dir PATH Working directory of local workspace [dirname(URL-PATH)] -I, --input-file-grp USE File group(s) used as input -O, --output-file-grp USE File group(s) used as output -g, --page-id ID Physical page ID(s) to process instead of full document [] --overwrite Remove existing output pages/images (with "--page-id", remove only those). Short-hand for OCRD_EXISTING_OUTPUT=OVERWRITE --debug Abort on any errors with full stack trace. Short-hand for OCRD_MISSING_OUTPUT=ABORT --profile Enable profiling --profile-file PROF-PATH Write cProfile stats to PROF-PATH. Implies "--profile" -p, --parameter JSON-PATH Parameters, either verbatim JSON string or JSON file path -P, --param-override KEY VAL Override a single JSON object key-value pair, taking precedence over --parameter -U, --mets-server-url URL URL of a METS Server for parallel incremental access to METS If URL starts with http:// start an HTTP server there, otherwise URL is a path to an on-demand-created unix socket -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE] Override log level globally [INFO] --log-filename LOG-PATH File to redirect stderr logging to (overriding ocrd_logging.conf). Options for information: -C, --show-resource RESNAME Dump the content of processor resource RESNAME -L, --list-resources List names of processor resources -J, --dump-json Dump tool description as JSON -D, --dump-module-dir Show the 'module' resource location path for this processor -h, --help Show this message -V, --version Show version Parameters: "font" [string - ""] Font file to be used in PDF file. If unset, AletheiaSans.ttf is used. (Make sure to pick a font which covers all glyphs!) "outlines" [string - ""] What segment hierarchy to draw coordinate outlines for. If unset, no outlines are drawn. Possible values: ["", "region", "line", "word", "glyph"] "textequiv_level" [string - ""] What segment hierarchy level to render text output from. If unset, no text is rendered. Possible values: ["", "region", "line", "word", "glyph"] "negative2zero" [boolean - false] Repair invalid or inconsistent coordinates before trying to convert. "ext" [string - ".pdf"] Output filename extension "multipage" [string - ""] Merge all PDFs into one multipage file. The value is used as METS file ID and file basename for the PDF. "multipage_only" [boolean - false] When producing a `multipage`, do not add single-page files into the output fileGrp (but use a temporary directory for them). "pagelabel" [string - "pageId"] Parameter for 'multipage': Set the labels used as page outlines. - 'pageId': physical page ID, - 'pagenumber': use consecutive numbers, - 'pagelabel': use '@ORDERLABEL - @LABEL', - 'basename': use the name of the input file, - 'local_filename': use the href relative path of the input file, - 'url': use the href URL of the input file, - 'ID': use the file ID of the input file Possible values: ["pagenumber", "pagelabel", "pageId", "basename", "basename_without_extension", "local_filename", "ID", "url"] "script-args" [string - ""] Extra arguments to PageToPdf (see https://github.com/PRImA-Research- Lab/prima-page-to-pdf)
-
Illegal reflective access by com.itextpdf.text.io.ByteBufferRandomAccessSource$1 to method java.nio.DirectByteBuffer.cleaner()
If that appears, try installing OpenJDK 8. -
java.lang.NullPointerException
If that appears, try (a little workaround) and set negative coordinates to zero:ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP ... -P negative2zero true
-
Some letters are illegible? Please note that the standard displayed font (AletheiaSans.ttf) does not support all Unicode glyphs. In case yours are missing, set a (monospace) Unicode font yourself:
ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP ... -P font /usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf
Fonts can also be referenced by file name if they are installed as processor resources. A number of options have been preconfigured, cf.
ocrd resmgr list-available -e ocrd-pagetopdf
. -
The multipage file's page labels can be configured, e.g. consecutively via
pagelabel=pagenumber
or from@ORDERLABEL
and@LABEL
viapagelabel=pagelabel
:ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP ... -P pagelabel pagelabel