Skip to content

Plain-text version of ccKres, the corpus of written Slovenian language

License

Notifications You must be signed in to change notification settings

mfilej/cckres-plain

Repository files navigation

Plain-text ccKres

Plain-text version of ccKres 1.0, the corpus of written Slovenian language. Includes a few examples of tools to extract useful information from TEI fies.

The plain-text files can be found in the plain-text-corpus directory, sorted as follows:

SSJ
├── I             - internet
└── T             - tisk (print)
    ├── D         - drugo (other)
    ├── K         - knjižno (literary)
    │   ├── L     - leposlovje (fiction)
    │   └── S     - strokovno (non-fiction)
    └── P         - periodično (periodicals)
        ├── C     - časopis (newspapers)
        └── R     - revija (magazines)

Although the text files were produced using teitomarkdown (part of TEIC/Stylesheets), the result is barely formatted into paragraphs. The text will therefore need substantial preprocessing for most uses.

Additionally we have extracted a list of all words found in the corpus, together with their morphosyntactic annotations (see morphosyntax_dict.txt).

Examples

Generating plain-text files

This repository already contains exctracted plain-text files. If, for whatever reason, you want to regenerate them, this is how they were originally generated:

$ rake kres:download[cckres]
$ rake kres:extract[cckres]
$ rake kres:sort[cckres,plain-text-corpus]

To generate the morphosyntactic dictionary:

$ rake kres:msd[~/Downloads/cckresV1_0/xml] \
  | sort \
  | uniq \
  > morphosyntax_dict.txt

The above tasks require the following programs to be available in your PATH:

  • curl
  • unzip
  • find
  • ruby

Ruby dependencies must be installed as well (gem install bundler if needed):

$ gem install bundler
$ bundle install

License

The code is licensed under the MIT license. The ccKres corpus is licensed under CC BY-NC-SA 4.0.

About

Plain-text version of ccKres, the corpus of written Slovenian language

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages