Create .tsv files that can be viewed and edited with neat.
Required python version is 3.11. Consider use of pyenv if that python version is not available on your system.
Activate virtual environment (virtualenv):
source venv/bin/activate
or (pyenv):
pyenv activate my-python-3.11-virtualenv
Update pip:
pip install -U pip
Install tsvtools:
pip install git+https://github.com/qurator-spk/page2tsv.git
Create a TSV file from OCR in PAGE-XML format (with word segmentation):
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
In order to create a TSV file for multiple PAGE XML files just perform successive calls of the tool using the same TSV file:
page2tsv PAGE1.xml PAGE.tsv --image-url=http://link-to-corresponding-image-1
page2tsv PAGE2.xml PAGE.tsv --image-url=http://link-to-corresponding-image-2
page2tsv PAGE3.xml PAGE.tsv --image-url=http://link-to-corresponding-image-3
page2tsv PAGE4.xml PAGE.tsv --image-url=http://link-to-corresponding-image-4
page2tsv PAGE5.xml PAGE.tsv --image-url=http://link-to-corresponding-image-5
...
...
...
For instance, for the file example.xml:
page2tsv example.xml example.tsv --image-url=http://content.staatsbibliothek-berlin.de/zefys/SNP27646518-18800101-0-3-0-0/left,top,width,height/full/0/default.jpg
Create a URL-annotated TSV file from an existing TSV file:
annotate-tsv enp_DE.tsv enp_DE-annotated.tsv
page2tsv --help
Usage: page2tsv [OPTIONS] PAGE_XML_FILE TSV_OUT_FILE
  Converts a page-XML file into a TSV file that can be edited with neat.
  Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as
  parameters and performs NER and EL and the document if these are provided.
  PAGE_XML_FILE: The source page-XML file. TSV_OUT_FILE: Resulting TSV file.
Options:
  --purpose [NERD|OCR]       Purpose of output tsv file.
                             
                             NERD: NER/NED application/ground-truth creation.
                             
                             OCR: OCR application/ground-truth creation.
                             
                             default: NERD.
  --image-url TEXT           An image retrieval link that enables neat to show
                             the scan images corresponding to the text tokens.
                             Example: https://content.staatsbibliothek-berlin.
                             de/zefys/SNP26824620-18371109-0-1-0-0/left,top,wi
                             dth,height/full/0/default.jpg
  --ner-rest-endpoint TEXT   REST endpoint of sbb_ner service. See
                             https://github.com/qurator-spk/sbb_ner for
                             details. Only applicable in case of NERD.
  --ned-rest-endpoint TEXT   REST endpoint of sbb_ned service. See
                             https://github.com/qurator-spk/sbb_ned for
                             details. Only applicable in case of NERD.
  --noproxy                  disable proxy. default: enabled.
  --scale-factor FLOAT       default: 1.0
  --ned-threshold FLOAT
  --min-confidence FLOAT
  --max-confidence FLOAT
  --ned-priority INTEGER
  --normalization-file PATH
  --help                     Show this message and exit.
tsv2tsv --help
Usage: tsv2tsv [OPTIONS] TSV_IN_FILE
Options:
  --tsv-out-file PATH          Write modified TSV to this file.
  --ner-rest-endpoint TEXT     REST endpoint of sbb_ner service. See
                               https://github.com/qurator-spk/sbb_ner for
                               details.
  --noproxy                    disable proxy. default: enabled.
  --num-tokens                 Print number of tokens in input/output file.
  --sentence-count             Print sentence count in input/output file.
  --max-sentence-len           Print maximum sentence len for input/output
                               file.
  --keep-tokenization          Keep the word tokenization exactly as it is.
  --sentence-split-only        Do only sentence splitting.
  --show-urls                  Print contained visualization URLs.
  --just-zero                  Process only files that have max sentence
                               length zero,i.e., that do not have sentence
                               splitting.
  --sanitize-sentence-numbers  Sanitize sentence numbering.
  --show-columns               Show TSV columns.
  --drop-column TEXT           Drop column
  --help                       Show this message and exit.
alto2tsv --help
Usage: alto2tsv [OPTIONS] ALTO_XML_FILE TSV_OUT_FILE
  Converts a ALTO-XML file into a TSV file that can be edited with neat.
  Optionally the tool also accepts NER and Entitiy Linking API-Endpoints as
  parameters and performs NER and EL and the document if these are provided.
  ALTO_XML_FILE: The source ALTO-XML file. 
  TSV_OUT_FILE: Resulting TSV file.
Options:
  --purpose [NERD|OCR]      Purpose of output tsv file.
                            
                            NERD: NER/NED application/ground-truth creation.
                            
                            OCR: OCR application/ground-truth creation.
                            
                            default: NERD.
  --image-url TEXT          An image retrieval link that enables neat to show
                            the scan images corresponding to the text tokens.
                            Example: https://content.staatsbibliothek-berlin.d
                            e/zefys/SNP26824620-18371109-0-1-0-0/left,top,widt
                            h,height/full/0/default.jpg
  --ner-rest-endpoint TEXT  REST endpoint of sbb_ner service. See
                            https://github.com/qurator-spk/sbb_ner for
                            details. Only applicable in case of NERD.
  --ned-rest-endpoint TEXT  REST endpoint of sbb_ned service. See
                            https://github.com/qurator-spk/sbb_ned for
                            details. Only applicable in case of NERD.
  --noproxy                 disable proxy. default: enabled.
  --scale-factor FLOAT      default: 1.0
  --ned-threshold FLOAT
  --ned-priority INTEGER
  --help                    Show this message and exit.