Skip to content

Files

Latest commit

cdddf1a · Jan 19, 2022

History

History
This branch is up to date with udieckmann/Kielipankki-utilities:master.

vrt-tools

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Apr 4, 2019
Jan 19, 2022
Sep 7, 2021
Dec 16, 2018
Dec 16, 2018
Nov 25, 2021
Dec 16, 2018
Apr 5, 2019
Dec 8, 2020
Mar 26, 2021
Nov 12, 2020
Mar 18, 2020
May 7, 2021
Feb 18, 2021
Nov 27, 2020
Jun 16, 2020
Aug 24, 2021
Aug 17, 2021
Aug 17, 2021
Aug 17, 2021
Aug 17, 2021
Aug 17, 2021
Aug 17, 2021
Aug 17, 2021
Aug 17, 2021
Apr 9, 2020
Aug 12, 2021
Dec 16, 2018
Jun 16, 2020
Sep 2, 2021
Aug 18, 2021
Aug 18, 2021
Mar 2, 2020
Apr 5, 2019
Apr 13, 2019
Apr 13, 2019
Dec 16, 2018
Sep 10, 2019
Nov 21, 2019
May 8, 2019
May 8, 2019
Nov 20, 2019
May 8, 2019
Dec 16, 2018
Jun 16, 2020
Jun 16, 2020
Feb 20, 2020
Dec 16, 2018
Jun 16, 2020
May 7, 2021
Mar 10, 2020
Apr 6, 2021
Sep 18, 2020
Oct 27, 2021
Dec 16, 2018
May 7, 2020
Dec 16, 2019
Apr 15, 2019
Dec 16, 2018
May 16, 2020
Apr 7, 2021
Dec 2, 2020
Apr 8, 2020
Apr 13, 2021
Mar 8, 2019
Apr 5, 2019
Apr 15, 2019
Nov 21, 2019
Sep 7, 2021
May 17, 2020
May 7, 2020
May 16, 2020
Dec 2, 2020
May 14, 2020
Mar 26, 2021
Mar 26, 2021
Mar 26, 2021
Jun 11, 2019
Dec 16, 2018
Dec 16, 2018
Dec 16, 2018
May 16, 2020
Jan 18, 2019
Dec 16, 2018
Sep 10, 2020
Jan 21, 2021
Mar 30, 2021
Dec 16, 2018
Apr 5, 2019
Mar 26, 2021
Mar 26, 2021
Mar 26, 2021
Nov 25, 2021
Feb 20, 2020
Feb 20, 2020
Feb 20, 2020
Feb 20, 2020
Feb 16, 2021
Feb 16, 2021
Apr 5, 2019
Jun 11, 2019
May 7, 2020
Mar 10, 2020
May 4, 2020
Mar 27, 2020
Apr 4, 2019
Dec 16, 2018
Apr 5, 2019

FIN-CLARIN VRT Tools

These command-line tools implement composable manipulations of segmented and annotated text in a VRT format aka verticalized text, related to Corpus WorkBench that is used in the backend to the Korp concordance enginge.

The VRT tools proper include (or are planned to include)

  • generic and special format manipulations
  • tokenization to produce VRT from a preliminary format (initially UDPipe tokenizers, soon also HFST tokenizers)
  • morphosyntactic annotators (initially Finnish pipeline from Turku NLP)
  • name recognition (FiNER planned soon)
  • report generation (sentence length already implemented)
  • conversion to other formats (to be planned)

Some tools depend on separate programs and models that are installed in the Taito command-line environment. These are typically free software, available for installation elsewhere.

Highlights

The basic function of the VRT tools is to preserve previous annotations, including structural markup that may contain valuable information about the text units, without the underlying tools even knowing that their input sentences are extracted from such context. New annotations from an underlying tool are added to their proper place in the input document.

The major innovation in FIN-CLARIN VRT is the use of names for the fields that are only positional in basic format. In the basic format the declaration of names is only a comment but these VRT tools use it extensively.

<!-- #vrt positional-attributes: word lemma pos -->

Field names facilitate further annotation of tokens regardless of what previous annotations exist.

(Please note that the format of the attribute name comment was changed slightly for VRT tools version 0.7.2 (2019-04-05). The old-format comment with Positional attributes instead of #vrt positional-attributes is still recognized by the tools.)

A minor innovation is the use of auxiliary formats to facilitate the production of VRT from other formats and manipulation of large VRT corpora in conveniently sized fragments.

Most of the tools transform an input document into an output document. Such tools have a common set of options that allow them to compose flexibly in different ways:

  • read from a named file or from standard input;
  • write to standard output, or
  • write to an explicitly named file, or
  • write to a sibling to the input file, or
  • replace the input file with the output,
  • with an optional backup file.

In case of failure, any partial output is left in a transparently named temporary file.

To be continued