Skip to content

OCR-D/ocrd_pagetopdf

Repository files navigation

ocrd-pagetopdf

OCR-D wrapper for prima-page-to-pdf

Python CI Docker CD PyPI CD

Contents:

Introduction

This package offers OCR-D compliant workspace processors for conversion of OCR data represented in METS (on the document level) and PAGE or ALTO (on the page level) to PDF.

It transforms both the scan image (facsimile) and annotations (text overlay), optionally drawing polygon outlines for text regions / lines / words / glyphs.

Optionally validates the structural annotation and fixes its coordinates before attempting conversion.

The text layer is generated from the textual annotation on the configured level of the structural hierarchy (region / line / word / glyph). It is rendered with a configurable font (which is useful to make sure all codepoints are covered by adequate glyphs, esp. in historic prints and manuscripts).

The page labels can be configured to use various attributes from the physical pages of the METS.

A table of contents will be added according to the labels of the recursive mets:div logical structure.

Requirements

  • GNU make
  • Python 3 with pip and venv
  • OCR-D
  • Java runtime (OpenJDK ≥8 works for PageToPdf 1.1.2)

Installation

With Docker

This is the best option if you want to run the software in a container.

You need to have Docker

docker pull ocrd/pagetopdf

To run with docker:

docker run -v path/to/workspaces:/data ocrd/pagetopdf ocrd-pagetopdf ...

Native, from PyPI

This is the best option if you want to use the stable, released version.

After installing Python and Java, simply do:

pip install ocrd_pagetopdf

Native, from git

Use this option if you want to change the source code or install the latest, unpublished changes.

We strongly recommend to use venv.

After installing make, assuming you are on a Debian/Ubuntu OS, you can do:

sudo make deps-ubuntu

Otherwise, simulate this step and install requirements with equivalent actions on your system:

make -n deps-ubuntu
...

Finally, to install the Python package, do:

make install
# or equivalently:
pip install .

Usage

The command-line interface ocrd-pagetopdf conforms to OCR-D processor specifications.

Assuming you have an OCR-D workspace in your current working directory, simply do:

ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -P textequiv_level word

This will run the script and create PDF files for each page with a text layer based on word-level annotations.

In order to create an additional multipage file for the entire document, named merged.pdf, concatenating the single page PDFs in physical order and with page labels and contents, do:

ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -P textequiv_level word -P multipage merged

In case your workspace does not contain fulltext in PAGE format, but ALTO, there is a dedicated processor CLI ocrd-altotopdf, with some limitations compared to the former:

  • You need to manually select the fileGrp providing the images which match the annotation coordinates, passing it as second input fileGrp. (The image references are required by PAGE, but not by ALTO.)
  • The images are not generated on-the-fly according to all annotations (from existing AlternativeImages, or by cropping via coordinates into the higher-level image, and deskewing when applicable), and not chosen via input_feature_selector / input_feature_filter mechanism. Instead, only the original images can be used here.
  • The annotations are not tested comprehensively regarding validity and consistency of coordinates and then repaired. Instead, only superficial checks and repairs can be applied (like negative coordinates).

Assuming you have a workspace representing a typical DFG-conforming METS, with FULLTEXT for ALTO and DEFAULT for the original images, do:

ocrd-altotopdf -I FULLTEXT,DEFAULT -O PDF-FILEGRP -P textequiv_level word -P multipage merged

For more options and explanations, see below.

ocrd-pagetopdf

OCR-D CLI
Usage: ocrd-pagetopdf [worker|server] [OPTIONS]

  Convert text and layout annotations from PAGE to PDF format (overlaying original image with text layer and polygon outlines)

  > Converts all pages of the document to PDF

  > For each page, open and deserialize PAGE input file and its
  > respective image. Then extract a derived image of the (cropped,
  > deskewed, binarized...) page, with features depending on
  > ``image_feature_selector`` (a comma-separated list of required image
  > features, cf. :py:func:`ocrd.workspace.Workspace.image_from_page`)
  > and ``image_feature_filter`` (a comma-separated list of forbidden
  > image features).

  > Next, generate a temporary PAGE output file for that very image
  > (adapting all coordinates if necessary). If ``negative2zero`` is
  > set, validate and repair invalid or inconsistent coordinates.

  > Convert the PAGE/image pair with PRImA PageToPdf, applying
  > - ``textequiv_level`` (i.e. `-text-source`) to retrieve a text layer, if set;
  > - ``outlines`` to draw boundary polygons, if set;
  > - ``font`` accordingly.

  > Copy the resulting PDF file to the output file group and reference
  > it in the METS.

  > Finally, if ``multipage`` is set, then concatenate all generated
  > files to a multi-page PDF file, setting ``pagelabels`` accordingly,
  > as well as PDF metadata and bookmarks. Reference it with
  > ``multipage`` as ID in the output file group, too. If
  > ``multipage_only`` is also set, then remove the single-page PDF
  > files afterwards.

Subcommands:
    worker      Start a processing worker rather than do local processing
    server      Start a processor server rather than do local processing

Options for processing:
  -m, --mets URL-PATH             URL or file path of METS to process [./mets.xml]
  -w, --working-dir PATH          Working directory of local workspace [dirname(URL-PATH)]
  -I, --input-file-grp USE        File group(s) used as input
  -O, --output-file-grp USE       File group(s) used as output
  -g, --page-id ID                Physical page ID(s) to process instead of full document []
  --overwrite                     Remove existing output pages/images
                                  (with "--page-id", remove only those).
                                  Short-hand for OCRD_EXISTING_OUTPUT=OVERWRITE
  --debug                         Abort on any errors with full stack trace.
                                  Short-hand for OCRD_MISSING_OUTPUT=ABORT
  --profile                       Enable profiling
  --profile-file PROF-PATH        Write cProfile stats to PROF-PATH. Implies "--profile"
  -p, --parameter JSON-PATH       Parameters, either verbatim JSON string
                                  or JSON file path
  -P, --param-override KEY VAL    Override a single JSON object key-value pair,
                                  taking precedence over --parameter
  -U, --mets-server-url URL       URL of a METS Server for parallel incremental access to METS
                                  If URL starts with http:// start an HTTP server there,
                                  otherwise URL is a path to an on-demand-created unix socket
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Override log level globally [INFO]
  --log-filename LOG-PATH         File to redirect stderr logging to (overriding ocrd_logging.conf).

Options for information:
  -C, --show-resource RESNAME     Dump the content of processor resource RESNAME
  -L, --list-resources            List names of processor resources
  -J, --dump-json                 Dump tool description as JSON
  -D, --dump-module-dir           Show the 'module' resource location path for this processor
  -h, --help                      Show this message
  -V, --version                   Show version

Parameters:
   "image_feature_selector" [string - ""]
    comma-separated list of required image features (e.g.
    binarized,despeckled,cropped,deskewed,rotated-90)
   "image_feature_filter" [string - ""]
    comma-separated list of forbidden image features (e.g.
    binarized,despeckled,cropped,deskewed,rotated-90)
   "font" [string - ""]
    Font file to be used in PDF file. If unset, AletheiaSans.ttf is used.
    (Make sure to pick a font which covers all glyphs!)
   "outlines" [string - ""]
    What segment hierarchy to draw coordinate outlines for. If unset, no
    outlines are drawn.
    Possible values: ["", "region", "line", "word", "glyph"]
   "textequiv_level" [string - ""]
    What segment hierarchy level to render text output from. If unset, no
    text is rendered.
    Possible values: ["", "region", "line", "word", "glyph"]
   "negative2zero" [boolean - false]
    Repair invalid or inconsistent coordinates before trying to convert.
   "ext" [string - ".pdf"]
    Output filename extension
   "multipage" [string - ""]
    Merge all PDFs into one multipage file. The value is used as METS
    file ID and file basename for the PDF.
   "multipage_only" [boolean - false]
    When producing a `multipage`, do not add single-page files into the
    output fileGrp (but use a temporary directory for them).
   "pagelabel" [string - "pageId"]
    Parameter for 'multipage': Set the labels used as page outlines.

    - 'pageId': physical page ID,

    - 'pagenumber': use consecutive numbers,

    - 'pagelabel': use '@ORDERLABEL - @LABEL',

    - 'basename': use the name of the input file,

    - 'local_filename': use the href relative path of the input file,

    - 'url': use the href URL of the input file,

    - 'ID': use the file ID of the input file
    Possible values: ["pagenumber", "pagelabel", "pageId", "basename",
    "basename_without_extension", "local_filename", "ID", "url"]
   "script-args" [string - ""]
    Extra arguments to PageToPdf (see https://github.com/PRImA-Research-
    Lab/prima-page-to-pdf)

ocrd-altotopdf

OCR-D CLI
Usage: ocrd-altotopdf [worker|server] [OPTIONS]

  Convert text and layout annotations from ALTO to PDF format (overlaying original image with text layer and polygon outlines)

  > Converts all pages of the document to PDF

  > For each page, find the ALTO input file in the first fileGrp,
  > together with the image input file in the second fileGrp.

  > Then convert ALTO to PAGE with PRImA PageConverter in a temporary
  > location.

  > Next convert the PAGE/image pair with PRImA PageToPdf in a temporary location,
  > applying
  > - ``textequiv_level`` (i.e. `-text-source`) to retrieve a text layer, if set;
  > - ``outlines`` to draw boundary polygons, if set;
  > - ``font`` accordingly;
  > - ``negative2zero`` (i.e. `-neg-coords toZero`) to repair negative coordintes.

  > Copy to the resulting PDF file to the output file group and
  > reference it in the METS.

  > Finally, if ``multipage`` is set, then concatenate all generated
  > files to a multi-page PDF file, setting ``pagelabels`` accordingly,
  > as well as PDF metadata and bookmarks. Reference it with
  > ``multipage`` as ID in the output fileGrp, too. If
  > ``multipage_only`` is also set, then remove the single-page PDF
  > files afterwards.

Subcommands:
    worker      Start a processing worker rather than do local processing
    server      Start a processor server rather than do local processing

Options for processing:
  -m, --mets URL-PATH             URL or file path of METS to process [./mets.xml]
  -w, --working-dir PATH          Working directory of local workspace [dirname(URL-PATH)]
  -I, --input-file-grp USE        File group(s) used as input
  -O, --output-file-grp USE       File group(s) used as output
  -g, --page-id ID                Physical page ID(s) to process instead of full document []
  --overwrite                     Remove existing output pages/images
                                  (with "--page-id", remove only those).
                                  Short-hand for OCRD_EXISTING_OUTPUT=OVERWRITE
  --debug                         Abort on any errors with full stack trace.
                                  Short-hand for OCRD_MISSING_OUTPUT=ABORT
  --profile                       Enable profiling
  --profile-file PROF-PATH        Write cProfile stats to PROF-PATH. Implies "--profile"
  -p, --parameter JSON-PATH       Parameters, either verbatim JSON string
                                  or JSON file path
  -P, --param-override KEY VAL    Override a single JSON object key-value pair,
                                  taking precedence over --parameter
  -U, --mets-server-url URL       URL of a METS Server for parallel incremental access to METS
                                  If URL starts with http:// start an HTTP server there,
                                  otherwise URL is a path to an on-demand-created unix socket
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Override log level globally [INFO]
  --log-filename LOG-PATH         File to redirect stderr logging to (overriding ocrd_logging.conf).

Options for information:
  -C, --show-resource RESNAME     Dump the content of processor resource RESNAME
  -L, --list-resources            List names of processor resources
  -J, --dump-json                 Dump tool description as JSON
  -D, --dump-module-dir           Show the 'module' resource location path for this processor
  -h, --help                      Show this message
  -V, --version                   Show version

Parameters:
   "font" [string - ""]
    Font file to be used in PDF file. If unset, AletheiaSans.ttf is used.
    (Make sure to pick a font which covers all glyphs!)
   "outlines" [string - ""]
    What segment hierarchy to draw coordinate outlines for. If unset, no
    outlines are drawn.
    Possible values: ["", "region", "line", "word", "glyph"]
   "textequiv_level" [string - ""]
    What segment hierarchy level to render text output from. If unset, no
    text is rendered.
    Possible values: ["", "region", "line", "word", "glyph"]
   "negative2zero" [boolean - false]
    Repair invalid or inconsistent coordinates before trying to convert.
   "ext" [string - ".pdf"]
    Output filename extension
   "multipage" [string - ""]
    Merge all PDFs into one multipage file. The value is used as METS
    file ID and file basename for the PDF.
   "multipage_only" [boolean - false]
    When producing a `multipage`, do not add single-page files into the
    output fileGrp (but use a temporary directory for them).
   "pagelabel" [string - "pageId"]
    Parameter for 'multipage': Set the labels used as page outlines.

    - 'pageId': physical page ID,

    - 'pagenumber': use consecutive numbers,

    - 'pagelabel': use '@ORDERLABEL - @LABEL',

    - 'basename': use the name of the input file,

    - 'local_filename': use the href relative path of the input file,

    - 'url': use the href URL of the input file,

    - 'ID': use the file ID of the input file
    Possible values: ["pagenumber", "pagelabel", "pageId", "basename",
    "basename_without_extension", "local_filename", "ID", "url"]
   "script-args" [string - ""]
    Extra arguments to PageToPdf (see https://github.com/PRImA-Research-
    Lab/prima-page-to-pdf)

FAQ

  • Illegal reflective access by com.itextpdf.text.io.ByteBufferRandomAccessSource$1 to method java.nio.DirectByteBuffer.cleaner() If that appears, try installing OpenJDK 8.

  • java.lang.NullPointerException If that appears, try (a little workaround) and set negative coordinates to zero:

    ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP ... -P negative2zero true
    
  • Some letters are illegible? Please note that the standard displayed font (AletheiaSans.ttf) does not support all Unicode glyphs. In case yours are missing, set a (monospace) Unicode font yourself:

    ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP ... -P font /usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf
    

    Fonts can also be referenced by file name if they are installed as processor resources. A number of options have been preconfigured, cf. ocrd resmgr list-available -e ocrd-pagetopdf.

  • The multipage file's page labels can be configured, e.g. consecutively via pagelabel=pagenumber or from @ORDERLABEL and @LABEL via pagelabel=pagelabel:

    ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP ... -P pagelabel pagelabel