diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 22bc7af..36714d2 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -11,8 +11,8 @@ Specifically, the following properties were checked: 2. If the library is implemented in Java, does it use the [Apache Maven](https://maven.apache.org/) project management tool? 3. Does the library exist as an OSGi bundle? Specifically, do releases contain an OSGi manifest? 4. Is an OSGi bundle of the library available for consumption from a [p2](https://www.eclipse.org/equinox/p2/) repository? -5. Does it provide the following features? If it does, does it also provide extension mechanisms for the feature? Which are the input and output data types or formats? - 1. Tokenization/segmentation +5. Does it provide the following features? If it does, does it also provide extension mechanisms for the feature? Which are the input and Output types or formats? + 1. Tokenization/
segmentation 2. Sentencing 3. Part-of-speech tagging 4. Contituency parsing @@ -22,36 +22,478 @@ Specifically, the following properties were checked: The results of the survey are below. ---- - +## Conclusion -### [library-name]() +The survey helped us to reduce the number of suitable candidates for integration in Hexatomic to four. +In a first step we eliminated libraries based on their usefulness as all-purpose libraries and feature set, their implementation language, their development status and up-to-dateness, and our own experiences. +Thus, we excluded most Python-based libraries, with the exeption of NLTK and SpaCy, of which we eliminated NLTK due to its organically grown ecosystemic nature and the integration inpracticabilities this would bring with it. +SpaCy remained included as it promised implementations closer to the state of the art. + +This left us with four potential candidates: [Apache OpenNLP](#apache-opennlp), [SpaCy](#spacy), [Spark NLP](#spark-nlp), and [Stanford CoreNLP](#stanford-corenlp). + +Of those, the only candidate in Java that is feature-complete was Stanford CoreNLP: +Apache OpenNLP did not seem to support dependency parsing, SparkNLP as well as SpaCy did not seem to support constituency parsing (directly). + +In the end, we settled on integration of Stanford CoreNLP in Hexatomic. +It seems to cater best for the core target group of early versions of Hexatomic, i.e., users with a strong linguistic rather than NLP background. +It also seemed architecturally easiest to integrate, as we will not have to bridge between Java and Python code at this stage. + +When Stanford CoreNLP is successfully integrated in Hexatomic, we will, however, attempt to integrate another, more state-of-the-art library, for which SpaCy and the unsurveyed [flairNLP](https://github.com/flairNLP/flair) seem to be best suited. + +### AllenNLP + +> [AllenNLP website](https://allennlp.org/) + +1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: There are some [taggers]() and [NLPstack]() available as Scala-packages. AllenNLP also provides [Docker images]() +2. [ ] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: n/a + +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- | :-------: | --------------------------------------------------------------------------------------------------------------------------- | --------------- | ------------------------ | +| Tokenization/
segmentation | X | [- Spacy Tokenizer]()
[- Pretrained Transformer Tokenizer]() | [Tokenizer Documentation]() | | | text as str | list[token] | +| Sentencing | X | - Sentence Splitter
- SpaCy Sentence Splitter | [Sentence Splitter API]() | | | text as str | list[str] | +| POS-tagging | | | | X | [Train Sentence Tagger Predictor]() | sentence as str | dict[str, numpy.ndarray] | +| Constituency parsing | X | | [Constituency Parser Demo]() | | | | | +| Dependency parsing | X | | [Dependency Parser Demo]() | | | | | +| Named Entity Recognition | | | | X | [Train Sentence Tagger Predictor]() | sentence as str | dict[str, numpy.ndarray] | +| Functionalities extensible | | | | | | | | +| Can consume own models | X | | [Building own models in AllenNLP]() | | | | | + + +
+ + +### Apache OpenNLP + +> [Apache OpenNLP website](https://opennlp.apache.org/) + +1. [x] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [x] Uses Maven (`pom.xml` exists) +3. [?] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [?] Is available from a p2 repository: +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | :-------: | ---------------------------------------------------------------------------------------------------------------------------- | --------------------------------------- | ------------------------------------------------- | +| Tokenization/
segmentation | X | [- Whitespace Tokenizer]()
[- Character Tokenizer]()
[- Maximum Entropy Tokenizer]() | [Tokenizer Documentation]() | X | [Tokenizer Training docs]() | text as string | - array of strings
- array of token spans | +| Sentencing | X | [- Newline Sentence Splitter]()
[- Maximum Entropy Sentence Splitter]() | [Sentence Splitter Documentation]() | X | [Sentence Splitter Training docs]() | text as string | - array of strings
- array of sentence spans | +| POS-tagging | X | | [POS Tagger Documentation]() | x | [POS Tagger Training docs]() | string array of tokens | string array of POS tags | +| Constituency parsing | X | | [Constituency Parser Documentation]() | X | [Constituency Parser Documentation]() | string of whitespace tokenized sentence | array of OpenNLP's `Parse` type | +| Dependency parsing | | | | | | | | +| Named Entity Recognition | X | | [NER Documentaion]() | X | [NER Training docs]() | string array of tokens | array of name spans | +| Functionalities extensible | X | | [Extension writing Documentation]() | | | | +| Can consume own models | X | | [General API description]() | | | | | + +
+ + + +### Clear NLP + +> [Clear NLP website]() + +1. [X] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [X] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: + +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ---------------- | -------------------------------------------------------------------------------------------------------------------------- | :-------: | ------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Tokenization/
segmentation | X | | [Tokenizer Documentation]() | | | input file in [raw format]() | output file in [line format]() | +| Sentencing | X | | [Sentence Splitter Documentation]() | | | input file in [raw format]() | output file in [line format]() | +| POS-tagging | X | | | X | [POS Tagger Documentation]() | input file in [raw format]() | output file in [tab separated values format]() | +| Constituency parsing | | | | | | | | +| Dependency parsing | X | | [Dep Parse Documentation]() | X | [General Training docs]() | input file in [raw format]() | output file in [tab separated values format]() | +| Named Entity Recognition | X | | [NER Documentation]() | X | [General Training docs]() | input file in [raw format]() | output file in [tab separated values format]() | +| Functionalities extensible | X | | [Configuration documentation]() | | | | | +| Can consume own models | X | | [How to add models]() | | | | | + + +
+ + + +### CogComp NLP Pipeline + +> [CogComp NLP Pipeline website]() + +1. [X] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [X] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: + +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----------------------------: | ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | ----------------------------------------------------------------------------------------------------------------------- | ------------------ | ------------------------------------------------------------------------------------ | +| Tokenization/
segmentation | X | | [General description]() | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | +| Sentencing | | | | | | | | +| POS-tagging | X | | [General description]() | | | string of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each annotation | +| Constituency parsing | X | | [General description]()
[Parser documentation]( ) | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | +| Dependency parsing | X | - Stanford NLP Parser
- CogComp Parser (requires POS-Tagger and Chunker as part of the Pipeline) | [- General description]()
[- Stanford Parser documentation]( ) | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | +| Named Entity Recognition | X (isn't part of the Pipeline) | | [NER Documentation]() | X | [Training docs]() | string of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each annotation | +| Functionalities extensible | X | | [Configuration documentation]() | | | | | +| Can consume own models | X | | [Use own Tokenizer]() | | | | | + + +
+ + +### GATE & ANNIE + +> [GATE & ANNIE website]() + +1. [X] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [X] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: + +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----------------------: | --------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | ------------- | ----------------------- | ----------- | +| Tokenization/
segmentation | X | | [ANNIE Tokenizer Documentation]( ) | | | string of document text | output file | +| Sentencing | X | - ANNIE Default Sentence Splitter
- ANNIE Regex Sentence Splitter | [ANNIE Sentence Splitter Documentation]( ) | | | | +| POS-tagging | X | | [ANNIE POS Tagger Documentation]() | | | | +| Constituency parsing | | | | | | | +| Dependency parsing | | | | | | | +| Named Entity Recognition | x | | [Gazeteer documentation]()
[- Semantic Tagger]() | | | | | +| Functionalities extensible | | | | | | | +| Can consume own models | | | | | | | + + +
+ +### Lingpipe + +> [Lingpipe website]() + +1. [X] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [?] Uses Maven (`pom.xml` exists) +3. [X] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [X] Is available from a p2 repository: + +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----------------------: | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | ----------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------- | ------------------------------------------------------------------------ | +| Tokenization/
segmentation | X | [- IndoEuropeanTokenizer]()
[- CharacterTokenizer]()
[- RegEx Tokenizer]()
[- NGram Tokenizer]()
[- Line Tokenizer Factory]() | [Lingpipe Book]() Chapter 3 (p.33)
[Tokenizer API]()
[Tokenization API]( ) | | | - string of text
- char[] array, startindex, endindex | array of token strings | +| Sentencing | X | | [Sentence Splitter Tutorial]( ) | X | | char[] array of text | sentences as set of chunks | +| POS-tagging | X | [- Chain CRF Tagger]()
[- Classifier Tagger]()
[- HMM Decoder]() | [- Lingpipe Book Chapter 11]()
[- POS-Tagger Tutorial]() | X | [POS Tagger Tutorial Paragraph: Training Part-of-Speech Models]() | list of tokens as strings | [Tagging]() | +| Constituency parsing | | | | | | | +| Dependency parsing | | | | | | | | +| Named Entity Recognition | X | | [NER Documentation]() | | | | | +| Functionalities extensible | X | | [Paragraph 3. Evaluating and Tuning Tagging Models]() | | | | | +| Can consume own models | X | | [Developing and Tuning Sentence Models]() | | | | + + +
+ + +### NLP Architect + +> [NLP Architect website]() + +1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: to the best of our knowledge it's only accesible via common ways to integrate Python scripts in Java (e.g. JEPP, PythonInterpreter, Runtime.exec(),..) +2. [ ] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: n/a + +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----------------------: | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------- | :--------------------------: | -------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | +| Tokenization/
segmentation | | | | | | | +| Sentencing | | | | | | | +| POS-tagging | X | X | [POS Tagger Documentation]() | X | [POS Tagger Training docs]() | | | +| Constituency parsing | | | | | | | +| Dependency parsing | X | X | [DepParse Documentation]() | | | - Filepath with POS-tagged Dataset in CONNLL-U format
- list of list of `ConllEntry`, where each entry represents a POS-tagged token and each nested list a sentence | | +| Named Entity Recognition | X | [Neural Tagger- CNNLSTM module]()
[Neural Tagger- IDCNN]()
[Transformer Token Classifier]() | [NER Documentation]() | X | [NER Training docs]() | | | +| Trainable models | X | | | | | | +| Can consume own models | X | | | | | | + + +
+ +### NLP4J + +> [NLP4J website]() + +1. [x] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [X] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: + +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | -------------------------------------------------------------------------------------- | ------------------------------------------------------------ | ---------------------------- | +| Tokenization/
segmentation | X | | [Tokenizer basic information]()
[Tokenizer Demo]() | X | [Basic Training information]() | - string line of text
- raw document text
tsv-format | list of `Token` types | +| Sentencing | | | | | | | | +| POS-tagging | X | | [POS Tag basic information]() | X | [Basic Training information]() | - string line of text
- raw document text
tsv-format | output-file with annotations | +| Constituency parsing | | | | | | | | +| Dependency parsing | X | | [Dependency Parse basic information]() | X | [Basic Training information]() | - string line of text
- raw document text
tsv-format | output-file with annotations | +| Named Entity Recognition | X | | [NER Documentation]() | X | [Basic Training information]() | - string line of text
- raw document text
tsv-format | output-file with annotations | +| Functionalities extensible | | | | | | | | +| Can consume own models | | | | | | | | + + +
+ + +### NLTK + +> [NLTK website]() 1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: to the best of our knowledge it's only accesible via common ways to integrate Python scripts in Java (e.g. JEPP, PythonInterpreter, Runtime.exec(),..) +2. [ ] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: n/a + +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | ---------------------------------------------------------- | +| Tokenization/
segmentation | X | [- Casual module]()
[- Destructive module]()
[- Regexp Tokenizer]()
[- Repp Tokenizer]()
[- Simple Tokenizer]()
[- Stanford Segmenter]()
[- TokTok Tokenizer]()
[- Penn Treebank Tokenizer]() | [- NLTK Tokenizer/Sentence Splitter Documentation]()
[- NLTK Book Chapter 3]() | | | str of text | list(str) of token | +| Sentencing | X | [- Punkt Sent Module]()
[- Regexp Tokenizer]()
[- Simple Tokenizer]() | [- NLTK Tokenizer/Sentence Splitter Documentation]()
[- NLTK Book Chapter 3]() | X | [- Punkt Sent Module]() | str of text | list(str) of sentences | +| POS-tagging | X | [- Brill Tagger]()
[- CRF Tagger]()
[- HMM Tagger]()
[- HunPos Tagger]()
[- Perceptron Tagger]()
[- Senna POS Tagger]()
[- Sequential Backoff Tagger]()
[- Stanford Tagger]()
[- TNT Tagger]() | [- NLTK Tagger Documentation]()
[- NLTK Book Chapter 5]() | X | [- Brill Tagger Training]()
[- CRF Tagger]()
[- HMM Tagger]()
[HunPos Tagger]()
[- Perceptron Tagger]()
[- Sequential Backoff Tagger]()
[- Stanford Tagger]()
[- TNT Tagger]() | - tokens (list(str))
- sentences (list(list(str))) | - list(tuple(str, str))
- list(list(tuple(str, str)) ) | +| Constituency parsing | X | [- Early Chart Parser]()
[- Recursive descent Parser]()
[- Shift Reduce Parser]()
[- Standford Parser]()
| [- NLTK Parser Documentation]()
[- NLTK Book - Chapter 8]() | | | sent (list(str))
sentences (list(list(str))) | - iter(Tree)
- iter(iter(Tree)) | +| Dependency parsing | X | [- CoreNLP Dependency Parser]()
[- Malt Parser]()
[- Nonprojective Dependency Parser]()
[- Projective Dependency Parser]()
[- Transition Parser]() | [- NLTK Parser Documentation]()
[- NLTK Book - Chapter 8 Pargraph 5]() | X | [- Malt Parser]()
[- Nonprojective Dependency Parser]()
[- Projective Dependency Parser]()
[- Transition Parser]() | - CoreNLP: sentences (list(str)) – input sentences to parse
- Malt: str sentences | CoreNLP: iter(iter(Tree))
Malt:iter(DependencyGraph) | +| Named Entity Recognition | X | | [- NLTK Book Chapter 7 Paragraph 5](< https://www.nltk.org/book/ch07.html>)
[- NE Chunker]() | X | [- NE Chunker]() | list of POS-tagged tokens | NE-tagged parse-tree | +| Functionalities extensible | | | | | | | | +| Can consume own models | X | | | | | | | + + +
+ + +### Pattern + +> [Pattern website]() + +1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: to the best of our knowledge it's only accesible via common ways to integrate Python scripts in Java (e.g. JEPP, PythonInterpreter, Runtime.exec(),..) +2. [ ] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: ? + +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ---------------- | ---------------------------------------------------------------- | :-------: | ------------- | -------------- | ----------------------------------------- | +| Tokenization/
segmentation | X | | [General documentation]() | | | string of text | nested list : [sentences[chunks[tokens]]] | +| Sentencing | X | | [General documentation]() | | | string of text | nested list : [sentences[chunks[tokens]]] | | | +| POS-tagging | X | | [General documentation]() | | | string of text | nested list : [sentences[chunks[tokens]]] | +| Constituency parsing | | | | | | | | +| Dependency parsing | | | | | | | | +| Named Entity Recognition | | | | | | | | +| Functionalities extensible | | | | | | | +| Can consume own models | | | | | | | | + + +
+ + + + + + +### SpaCy + +> [SpaCy website]() + +1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: [SpaCy Container & APIs]() +2. [ ] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: n/a + +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | --------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | --------------------------------------------------------------------- | +| Tokenization/
segmentation | X | | [Tokenizer Documentaion]( ) | | | raw document text | SpaCy's `Doc` type - tokens accessible through `token.text` property | +| Sentencing | X | | [Sentencizer Documentation]() | | | SpaCy's `Doc` type | SpaCy's `Doc` type - sentence access through `doc.sents` property | +| POS-tagging | X | | [POS Tagger Documentation]() | X | [Training Basics]()
[Updating POS Tagger]() | Spacy's `Doc` type | Spacy's `Doc` type - POS-Tags access through`token.pos_` attribute | +| Constituency parsing | | | | | | | | +| Dependency parsing | X | | [Parser Documentation]() | X | [Training Basics]()
[Updating Dependencs Parser]() | Spacy's `Doc` type | Spacy's `Doc` type - Parse-tags access through `token.dep_` attribute | +| Named Entity Recognition | X | | [NER Documentation]() | | | SpaCy's `Doc` type | SpaCy's `Doc` type - entity-tags access through `doc.ents` property | +| Functionalities extensible | X | | [Extend Tokenizer]()
[Customize Tokenizer]()
[Custom Component]() | | | | +| Can consume own models | (X) | | [Load different Tokenizer]() | +| | | | + + +
+ +### Spark NLP + +> [Spark NLP website]() + +1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: [Using Spark NLP via Scala and Maven]() +2. [X] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: + +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- | :-------: | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------- | ------------------------------ | +| Tokenization/
segmentation | X | [^2] | [Tokenizer Documentation]() | | | `Document` type | `Token` types | +| Sentencing | X | [^2] | [Sentence Detector Documentation]() | | | `Document` type | `Sentence` types | +| POS-tagging | X | [^2] | [POS Tagger Documentation]() | X | [POS Training docs]()
[General Training docs]() | `Document` type
`Token`types | `POS` Type | +| Constituency parsing | | | | | | | | +| Dependency parsing | X | [- Typed Dependency Parser]()
[Untyped Dependency Parser]() | [Dependency Parser Documentation]() | X | [General Training docs]() | `Document` type
`POS` type
`Token` type | (unlabeled) `Dependency` types | +| Named Entity Recognition | X | [- NER CRF]()
[- NER DL]() | [NER Documentation](<>) | X | [- NER CRF]()
[- NER DL]() | | | +| Functionalities extensible | X | | [Manipulating Pipelines]() | | | `Document` type | `Named_Entity` types | +| Can consume own models | X | | [General Concept Documentation]() | | | | | + + +
+ +[^2]:Spark NLP provides a [library of pretrained Pipelines]() and a [library of models](). This survey however refers to the general Annotators. + +### Stanford CoreNLP + +> [Stanford CoreNLP website]() + +1. [X] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java - Can be addressed as follows: n/a +2. [X] Uses Maven (`pom.xml` exists) +3. [X] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [X] Is available from a p2 repository: + +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------- | :-------: | ------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Tokenization/
segmentation | X | [- Abstract Tokenizer]()
[- CHBT Tokenizer]()
[- Lexer Tokenizer]()
[- Negra Penn Tokenizer]()
[- Penn Treebank Tokenizer]()
[- PTB Tokenizer]()
[- Robust Tokenizer]()
[- WhitespaceTokenizer]() | [Tokenizer Documentation]() | | | - string of document text
- CoreNLPs `CoreDocument` (Instatiated with string document text) | - list of strings
- list of characteroffsetbegin indices
- list of characteroffsetendindices
`CoreDocument` with previous annotation properties | +| Sentencing | X | | [Sentencer Documentation]() | | | tokenized `CoreDocument` | `CoreDocument` with Sentence List of POS-Tags as property | +| POS-tagging | X | | [POS Tag Documentation]() | X | | tokenized and sentence-splitted `CoreDocument` | `CoreDocument` with String List of POS-Tags as property | +| Constituency parsing | X | [Viterbi Parser]()
[Shift reduce Parser]()
[Iterative CKYPCFG Parser]()
[Fast Factored Parser]()
[Exhaustive PCFG Parser]() | [Constituency Parser Documentation]() | | | tokenized, sentence-splitted (and for some models POS-tagged) `CoreDocument` | `CoreDocument` with TreeAnnotation (exact form depends on chosen parser) | +| Dependency parsing | X | [BiLexPCFGParser]()
[Exhaustive Dependency Parser]() | [Dep Parse Documentation]() | X | [Train own Model]() | tokenized, sentence-splitted and POS-tagged `CoreDocument` | `CoreDocument` with DependencyAnnotation (exact form depends on chosen parser) | +| Named Entity Recognition | X | [NER Classifier Combiner]()
[- Regex NER Annotator]() | [NER Documentation]() | X | [NER Training docs]() | tokenized, ssplitted, pos-tagged, (lemmatized) `CoreDocument` | `CoreDocument` with `Named Entity Tag Annotation` or `Normalized Named Entity Tag Annotation` | +| Functionalities extensible | X | [Custom annotator]() | | | | +| Can consume own models | X | | [Example of including own (caseless) model]() | | | | + + +
+ +### Talismane + +> [Talismane website]() + +1. [X] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: to the best of our knowledge it's only accesible via common ways to integrate Python scripts in Java (e.g. JEPP, PythonInterpreter, Runtime.exec(),..) +2. [X] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: + +#### Feature matrix + +
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------- | :-------: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | ------------ | +| Tokenization/
segmentation | X | - Simple tokenizer
- pattern tokenizer | [Tokenization Documentation](<>) | X | [Tokenizer Training docs]() | string of raw text | CoNLL format | +| Sentencing | X | | | X | [Sentence Splitter Training docs]() | string of raw text | CoNLL format | +| POS-tagging | X | | | X | [POS Tagger Training docs]() | String of raw text | CoNLL format | +| Constituency parsing | | | | | | | +| Dependency parsing | X | | [DepParser Documentation (under construction)]() | X | [DepParser Training docs]() | string of raw text | CoNLL format | +| Named Entity Recognition | | | | | | | | +| Functionalities extensible | X | | [Advanced Usage]() | | | | | +| Can consume own models | | | | | | | | + + + +
+ + +### TextBlob + +> [TextBlob website]() + +1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: to the best of our knowledge it's only accesible via common ways to integrate Python scripts in Java (e.g. JEPP, PythonInterpreter, Runtime.exec(),..) 2. [ ] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [ ] Is available from a p2 repository: n/a +4. [ ] Is available from a p2 repository: #### Feature matrix -
+
+ +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | ------------- | ------------------ | -------------------------------------------------------------------------------------------- | +| Tokenization/
segmentation | X | | [Tokenization Tutorial]()
[Advanced Tokenization Documentation]() | | | string of raw text | `TextBlob` data type - access through `words` property: WordList of word strings | +| Sentencing | X | | [Sentence Splitting Tutorial]()
[Advanced Sentence Splitting Documentation]() | | | string of raw text | `TextBlob` data type - access through `sentences` property: list of sentence objects | +| POS-tagging | X | - PatternTagger
- NLTKTagger | [POS Tagger Tutorial]()
[POS Tagger Advanced Usage]() | | | string of raw text | `TextBlob` data type - access through `tags` property: List of word string tag string tuples | +| Constituency parsing | X | | [Parser Tutorial]()
[Parser Advanced Usage]() | | | String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | +| Dependency parsing | X | | [Parser Tutorial]()
[Parser Advanced Usage]() | | | String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | +| Named Entity Recognition | | | | | | | | +| Functionalities extensible | | | | | | | | +| Can consume own models | X | | [Passing models into the Pipeline]()
[Training own data]() | | | | | -| | Has functionality | Functionality extensible | Functionality documentation | Extension documentation | Input data | Output data | -|---------------------------|--------------------------|--------------------------|-----------------------------|------------------------------|--------------------------------------|-------------| -| Tokenization/segmentation | | | | | | | -| Sentencing | | | | | | | -| POS-tagging | | | | | | | -| Constituency parsing | | | | | | | -| Dependency parsing | | | | | | | -| Trainable models | | | | | | | -| Can consume own models | | | | | | |
---- + -[^1]: The survey was carried out by Prashant Dangwal. \ No newline at end of file +[^1]: The survey was carried out by Clara Lachenmaier.