From 61e98ddeaa728092c902393d871f8e82517efebf Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 6 Jan 2021 08:21:37 +0100 Subject: [PATCH 01/39] Add OpenNLP --- src/architecture/extensibility/nlp-survey.md | 29 +++++++++++++++++++- 1 file changed, 28 insertions(+), 1 deletion(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 22bc7af..c2c6f04 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -52,6 +52,33 @@ The results of the survey are below. --- + +### [Apache OpenNLP]() + +1. [x] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [x] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: n/a +#### Feature matrix + +
+ +| | Has functionality | Functionality extensible | Functionality documentation | Extension documentation | Input data | Output data | +|---------------------------|:--------------------------:|:--------------------------:|-----------------------------|------------------------------|--------------------------------------|-------------| +| Tokenization/segmentation | X | X | | | string of (untokenized) text |-array of strings
-array of token spans| +| Sentencing | X | X || | string of document text | -array of strings
-array of sentence spans| +| POS-tagging | X | X | | | string array of tokens | string array of tags | +| Constituency parsing | X | X | | | String of whitespace tokenized Sentence | array of OpenNLP's Parse Type | +| Dependency parsing | | | | | | | +| Trainable models | X | X | | | | | +| Can consume own models | X | X | -
-
-
- | | | + +
+ + + -[^1]: The survey was carried out by Prashant Dangwal. \ No newline at end of file +[^1]: The survey was carried out by Clara Lachenmaier. From 69ff61d90549d92b814597e40d5a39c3b33dd807 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 6 Jan 2021 08:24:26 +0100 Subject: [PATCH 02/39] Add Spacy --- src/architecture/extensibility/nlp-survey.md | 28 ++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index c2c6f04..ef2ddfc 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -79,6 +79,34 @@ The results of the survey are below. +### [SpaCy]() + +1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [ ] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: n/a + +#### Feature matrix + +
+ +| | Has functionality | Functionality extensible | Functionality documentation | Extension documentation | Input data | Output data | +|---------------------------|--------------------------|--------------------------|-----------------------------|------------------------------|--------------------------------------|-------------| +| Tokenization/segmentation | X | X | | -
- | raw document text | SpaCy's `Doc` type | +| Sentencing | X | X | | | SpaCy's `Doc` type | SpaCy's `Doc` type | +| POS-tagging | X | | | | Spacy's `Doc` type |Spacy's `Doc` type | +| Constituency parsing | | | | | | | +| Dependency parsing | X | | | | Spacy's `Doc` type | Spacy's `Doc` type | +| Trainable models | X | | | | | | +| Can consume own models | X | | | | | | + + +
+ + + [^1]: The survey was carried out by Clara Lachenmaier. From 235b82ed616453913e87a1466ea4cc5acf6c3ce5 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 6 Jan 2021 09:39:28 +0100 Subject: [PATCH 03/39] Add CoreNLP --- src/architecture/extensibility/nlp-survey.md | 27 ++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index ef2ddfc..197186e 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -79,6 +79,33 @@ The results of the survey are below. +### [CoreNLP]() + +1. [X] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [X] Uses Maven (`pom.xml` exists) +3. [X] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: n/a + +#### Feature matrix + +
+ +| | Has functionality | Functionality extensible | Functionality documentation | Extension documentation | Input data | Output data | +|---------------------------|--------------------------|--------------------------|-----------------------------|------------------------------|--------------------------------------|-------------| +| Tokenization/segmentation | X | X | | | -string of document text
-CoreNLPs `CoreDocument` (Instatiated with string document text) | -list of strings
-list of characteroffsetbegin indices
-list of characteroffsetendindices
`CoreDocument` with previous annotation properties | +| Sentencing | X | X | | | tokenized `CoreDocument` | `CoreDocument` with Sentence List of POS-Tags as property | +| POS-tagging | X | X | | | tokenized and sentence-splitted `CoreDocument` | `CoreDocument` with String List of POS-Tags as property | +| Constituency parsing | X | X | | | tokenized, sentence-splitted (and for some models POS-tagged) `CoreDocument` | `CoreDocument` with TreeAnnotation (exact form depends on chosen parser) | +| Dependency parsing | X | X | | | tokenized, sentence-splitted and POS-tagged `CoreDocument` | `CoreDocument` with DependencyAnnotation (exact form depends on chosen parser) | +| Trainable models | X | | | | | | +| Can consume own models | X | | | | | | + + +
+ + ### [SpaCy]() 1. [ ] Implemented in Java From ae2df8102d3698b656772552f87dd9e8797a520f Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 27 Jan 2021 06:50:47 +0100 Subject: [PATCH 04/39] Add Spark NLP --- src/architecture/extensibility/nlp-survey.md | 28 ++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 197186e..e1a78fc 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -106,6 +106,34 @@ The results of the survey are below. +### [Lingpipe]() + +1. [X] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [?] Uses Maven (`pom.xml` exists) +3. [X] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [?] Is available from a p2 repository: ? + +#### Feature matrix + +
+ +| | Has functionality | Functionality extensible | Functionality documentation | Extension documentation | Input data | Output data | +|---------------------------|--------------------------|--------------------------|-----------------------------|------------------------------|--------------------------------------|-------------| +| Tokenization/segmentation | X | | Chapter 3 (p.33)

- | | -String of text
-Character array, Startindex, Endindex | | +| Sentencing | X | | | | | | +| POS-tagging | X | | | | | | +| Constituency parsing | | | | | | | +| Dependency parsing | | | | | | | +| Trainable models | | | | | | | +| Can consume own models | | | | | | | + + +
+ + + ### [SpaCy]() 1. [ ] Implemented in Java From da258088ecfbcc0ac6e79b11e9a98fb9cb2d6cda Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 27 Jan 2021 06:52:57 +0100 Subject: [PATCH 05/39] Add Pattern --- src/architecture/extensibility/nlp-survey.md | 53 ++++++++++++++++++++ 1 file changed, 53 insertions(+) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index e1a78fc..ca4dcd2 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -103,6 +103,32 @@ The results of the survey are below. | Can consume own models | X | | | | | | + + +### [Pattern]() + +1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [ ] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: ? + +#### Feature matrix + +
+ +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | |[General documentation]() | | | String of text | nested list : [sentences[chunks[tokens]]] | +| Sentencing | X | | [General documentation]()| | | String of text | nested list : [sentences[chunks[tokens]]] | | | +| POS-tagging | X | | [General documentation]() | | | String of text | nested list : [sentences[chunks[tokens]]] | +| Constituency parsing | | | | | | | | +| Dependency parsing | | | | | | | | +| Functionalities extensible | | | | | | | +| Can consume own models | | | | | | | | + +
@@ -160,6 +186,33 @@ The results of the survey are below. +### [Spark NLP]() + +1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: [Using Spark NLP via Scala and Maven]() +2. [X] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: ? https://mvnrepository.com/artifact/JohnSnowLabs/spark-nlp + +#### Feature matrix + +
+ +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | [^2] | [Tokenizer Documentation]() | | | Document | Token | +| Sentencing | X | [^2] | [Sentence Detector Documentation]() | | | Document | Sentences | +| POS-tagging | X | [^2] | [POS-Tagger Documentation]() | X | [POS Training Documentation]()
[General Training Documentation]()| Document, Token | POS | +| Constituency parsing | | | | | | | | +| Dependency parsing | X | -[Typed Dependency Parser]()
[Untyped Dependency Parser]() | [Dependency Parser Documentation]()| X | [General Training Documentation]() | Document,POS,Token | (unlabeled) Dependeny | +| Functionalities extensible | X | | [Manipulating Pipelines]() | | | | | +| Can consume own models | X | | [General Concept Documentation]() | | | | | + + +
+ +[^2]:Spark NLP provides a [library of pretrained Pipelines]() and a [library of models](). This survey however refers to the general Annotators. From 540c583303b8ef8212645bb307258a4e41eb3ae3 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 27 Jan 2021 06:55:06 +0100 Subject: [PATCH 06/39] Add NLP Architect --- src/architecture/extensibility/nlp-survey.md | 28 ++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index ca4dcd2..56d0da7 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -105,6 +105,34 @@ The results of the survey are below. +### [NLP Architect]() + +1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [ ] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: n/a + +#### Feature matrix + +
+ +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | | | | | | | +| Sentencing | | | | | | | +| POS-tagging | X | X | [POS-Tagger Documentation]() | https://intellabs.github.io/nlp-architect/tagging/sequence_tagging.html#custom-training-parameters | | | +| Constituency parsing | | | | | | | +| Dependency parsing | X | X | [DepParse Documentation]() | | | - Filepath with POS-tagged Dataset in CONNLL-U format
-list of list of `ConllEntry`, where each entry represents a POS-tagged Token and each nested List a sentence | | +| Trainable models | X | | | | | | +| Can consume own models | X | | | | | | + + +
+ + + ### [Pattern]() 1. [ ] Implemented in Java From fe31d3b34d5e54f3330f20dba6ea71a8d0a885b1 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 27 Jan 2021 06:56:23 +0100 Subject: [PATCH 07/39] Add Textblob --- src/architecture/extensibility/nlp-survey.md | 28 ++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 56d0da7..306ce28 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -243,6 +243,34 @@ The results of the survey are below. [^2]:Spark NLP provides a [library of pretrained Pipelines]() and a [library of models](). This survey however refers to the general Annotators. +### [TextBlob]() + +1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [ ] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: + +#### Feature matrix + +
+ +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | | [Tokenization Tutorial]()
[Advanced Tokenization Documentation]() | | | String of raw text | `TextBlob` data type - access through `words` property: WordList of word strings | +| Sentencing | X | | [Sentence-Splitting Tutorial]()
[Advanced Sentence-Splitting Documentation]() | | | String of raw text | `TextBlob` data type - access through `sentences` property: List of Sentence Objects| +| POS-tagging | X | -PatternTagger
-NLTKTagger | [POS-Tagger Tutorial]()
[POS-Tagger Advanced Usage]() | | | String of raw text | `TextBlob` data type - access through `tags` property: List of word string tag string tuples | +| Constituency parsing | X | | [Parser Tutorial]()
[Parser Advanced Usage]()| | | String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | +| Dependency parsing | X | |[Parser Tutorial]()
[Parser Advanced Usage]()| | |String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | +|Functionalities extensible | | | | | | | | +| Can consume own models | X | | [Passing models into the Pipeline]()
[Training own data]() | | | | | + + + +
+ + [^1]: The survey was carried out by Clara Lachenmaier. From 155636be0add9420d1894d7520d0eddcc4975eeb Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 27 Jan 2021 06:57:56 +0100 Subject: [PATCH 08/39] Add CogComp NLP Pipeline --- src/architecture/extensibility/nlp-survey.md | 26 ++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 306ce28..48f46e1 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -79,6 +79,32 @@ The results of the survey are below. +### [CogComp NLP Pipeline]() + +1. [X] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [X] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: n/a + +#### Feature matrix + +
+ +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | | [General description]() | | | String of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each Annotation | +| Sentencing | | | | | | | | +| POS-tagging | X | | [General description]() | | | String of raw text | `TextAnnotation` data structure with `View` for each Annotation | +| Constituency parsing | X | | [General description]()
[Parser documentation]( ) | | | String of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each Annotation | +| Dependency parsing | X | -Stanford NLP Parser
-CogComp Parser (requires POS-Tagger and Chunker as part of the Pipeline) | [General description]()
[Stanford Parser documentation]( ) | | | String of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each Annotation | +| Functionalities extensible | X | | [Configuration documentation]() | | | | | +| Can consume own models | X | | [Use own Tokenizer]() | | | | | + + +
+ ### [CoreNLP]() 1. [X] Implemented in Java From dfc855a036271db67807967414e9cb38cb41d772 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 27 Jan 2021 07:00:59 +0100 Subject: [PATCH 09/39] Revise SpaCy --- src/architecture/extensibility/nlp-survey.md | 21 ++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 48f46e1..c92c695 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -214,7 +214,7 @@ The results of the survey are below. -### [SpaCy]() +### [SpaCy]() 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -227,15 +227,16 @@ The results of the survey are below.
-| | Has functionality | Functionality extensible | Functionality documentation | Extension documentation | Input data | Output data | -|---------------------------|--------------------------|--------------------------|-----------------------------|------------------------------|--------------------------------------|-------------| -| Tokenization/segmentation | X | X | | -
- | raw document text | SpaCy's `Doc` type | -| Sentencing | X | X | | | SpaCy's `Doc` type | SpaCy's `Doc` type | -| POS-tagging | X | | | | Spacy's `Doc` type |Spacy's `Doc` type | -| Constituency parsing | | | | | | | -| Dependency parsing | X | | | | Spacy's `Doc` type | Spacy's `Doc` type | -| Trainable models | X | | | | | | -| Can consume own models | X | | | | | | +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | | [Tokenizer Documentaion]( )| | |raw document text | SpaCy's `Doc` type - tokens accessible through `token.text` property | +| Sentencing | X | | [Sentencizer Documentation]() | | | SpaCy's `Doc` type | SpaCy's `Doc` type - sentence access through `doc.sents` property | +| POS-tagging | X | | [POS-Tagger Documentation]() |X | [Training Basics]()
[Updating POS-Tagger]() | Spacy's `Doc` type |Spacy's `Doc` type - POS-Tags access through`token.pos_` attribute| +| Constituency parsing | | | | | | | | +| Dependency parsing | X | | [Parser Documentation]() | X| [Training Basics]()
[Updating Dependencs Parser]() | Spacy's `Doc` type | Spacy's `Doc` type - Parse-tags access through `token.dep_` attribute | +| Functionalities extensible | X | | [Extend Tokenizer]()
[Customize Tokenizer]()
[Custom Component]() | | | | +| Can consume own models | (X) | | [Load different Tokenizer]() + | | | |
From f8e845970458be5bd04b19468fe00e654c3c94f8 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 27 Jan 2021 07:04:27 +0100 Subject: [PATCH 10/39] Revise Standford CoreNLP --- src/architecture/extensibility/nlp-survey.md | 52 ++++++++++---------- 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index c92c695..7ed2261 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -103,32 +103,6 @@ The results of the survey are below. | Can consume own models | X | | [Use own Tokenizer]() | | | | | - - -### [CoreNLP]() - -1. [X] Implemented in Java - 1. [ ] Not Java, but API can be addressed from Java - - Can be addressed as follows: n/a -2. [X] Uses Maven (`pom.xml` exists) -3. [X] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [ ] Is available from a p2 repository: n/a - -#### Feature matrix - -
- -| | Has functionality | Functionality extensible | Functionality documentation | Extension documentation | Input data | Output data | -|---------------------------|--------------------------|--------------------------|-----------------------------|------------------------------|--------------------------------------|-------------| -| Tokenization/segmentation | X | X | | | -string of document text
-CoreNLPs `CoreDocument` (Instatiated with string document text) | -list of strings
-list of characteroffsetbegin indices
-list of characteroffsetendindices
`CoreDocument` with previous annotation properties | -| Sentencing | X | X | | | tokenized `CoreDocument` | `CoreDocument` with Sentence List of POS-Tags as property | -| POS-tagging | X | X | | | tokenized and sentence-splitted `CoreDocument` | `CoreDocument` with String List of POS-Tags as property | -| Constituency parsing | X | X | | | tokenized, sentence-splitted (and for some models POS-tagged) `CoreDocument` | `CoreDocument` with TreeAnnotation (exact form depends on chosen parser) | -| Dependency parsing | X | X | | | tokenized, sentence-splitted and POS-tagged `CoreDocument` | `CoreDocument` with DependencyAnnotation (exact form depends on chosen parser) | -| Trainable models | X | | | | | | -| Can consume own models | X | | | | | | - -
### [NLP Architect]() @@ -269,6 +243,32 @@ The results of the survey are below. [^2]:Spark NLP provides a [library of pretrained Pipelines]() and a [library of models](). This survey however refers to the general Annotators. +### [Stanford CoreNLP]() + +1. [X] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [X] Uses Maven (`pom.xml` exists) +3. [X] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: n/a + +#### Feature matrix + +
+ +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | see [list]() of classes implementing the Tokenizer interface | [Tokenizer Documentation]() | | | -string of document text
-CoreNLPs `CoreDocument` (Instatiated with string document text) | -list of strings
-list of characteroffsetbegin indices
-list of characteroffsetendindices
`CoreDocument` with previous annotation properties | +| Sentencing | X | | [Sentencer Documentation]() | | | tokenized `CoreDocument` | `CoreDocument` with Sentence List of POS-Tags as property | +| POS-tagging | X | | [POS-Tag Documentation]() | | | tokenized and sentence-splitted `CoreDocument` | `CoreDocument` with String List of POS-Tags as property | +| Constituency parsing | X | [Viterbi Parser]()
[Shift reduce Parser]()
[Iterative CKYPCFG Parser]()
[Fast Factored Parser]()
[Exhaustive PCFG Parser]() | [Constituency Parser Documentation]() | | | tokenized, sentence-splitted (and for some models POS-tagged) `CoreDocument` | `CoreDocument` with TreeAnnotation (exact form depends on chosen parser) | +| Dependency parsing | X |[BiLexPCFGParser]()
[Exhaustive Dependency Parser]() |[DepParse Documentation]() | X | [Train own Model]() |tokenized, sentence-splitted and POS-tagged `CoreDocument` | `CoreDocument` with DependencyAnnotation (exact form depends on chosen parser) | +| Functionalities extensible | X | [Custom annotator]() | | | | +| Can consume own models | X | | [Example of including own (caseless) model]() | | | | + + +
+ ### [TextBlob]() From 763bc604c79973e9833bf9074e20a9943c4ca438 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 27 Jan 2021 07:05:45 +0100 Subject: [PATCH 11/39] Revise Apache OpenNlp --- src/architecture/extensibility/nlp-survey.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 7ed2261..3131aac 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -65,15 +65,15 @@ The results of the survey are below.
-| | Has functionality | Functionality extensible | Functionality documentation | Extension documentation | Input data | Output data | -|---------------------------|:--------------------------:|:--------------------------:|-----------------------------|------------------------------|--------------------------------------|-------------| -| Tokenization/segmentation | X | X | | | string of (untokenized) text |-array of strings
-array of token spans| -| Sentencing | X | X || | string of document text | -array of strings
-array of sentence spans| -| POS-tagging | X | X | | | string array of tokens | string array of tags | -| Constituency parsing | X | X | | | String of whitespace tokenized Sentence | array of OpenNLP's Parse Type | -| Dependency parsing | | | | | | | -| Trainable models | X | X | | | | | -| Can consume own models | X | X | -
-
-
- | | | +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable | Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation |X|-Whitespace Tokenizer
-Character Tokenizer
-Maximum Entropy Tokenizer || X | |string of (untokenized) text | array of strings
-array of token spans| +| Sentencing|X| || X | |string of document text | -array of strings
-array of sentence spans | +| POS-tagging| X | | | x | | string array of tokens | string array of tags | +| Constituency parsing | X | | | X | | String of whitespace tokenized Sentence | array of OpenNLP's Parse Type | +| Dependency parsing | | | | | | | | +|Functionalities extensible|X| | | | | | +| Can consume own models | | | | | | | |
From c18f95aea4b3785e7fba740551a401979edf127a Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 27 Jan 2021 07:45:42 +0100 Subject: [PATCH 12/39] Add NLP4J --- src/architecture/extensibility/nlp-survey.md | 24 ++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 3131aac..b1688fe 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -131,7 +131,31 @@ The results of the survey are below. +### [NLP4J]() +1. [x] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [X] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: + +#### Feature matrix + +
+ +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | | [Tokenizer basic information]()
[Tokenizer Demo]() | X | [Basic Training information]() | -string line of text
-Raw document text
tsv-format | List of Token types | +| Sentencing | | | | | | | | +| POS-tagging | X | | [POS-Tag basic information]() | X | [Basic Training information]() | -string line of text
-Raw document text
tsv-format | Output-file with annotations| +| Constituency parsing | | | | | | | | +| Dependency parsing | X | | [Dependency Parse basic information]() | X | [Basic Training information]() | -string line of text
-Raw document text
tsv-format | Output-file with annotations | +| Functionalities extensible | | | | | | | | +| Can consume own models | | | | | | | | + + +
### [Pattern]() From ed86e4fc8976018078ee90e20b8829fcbf4807b3 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 27 Jan 2021 09:17:22 +0100 Subject: [PATCH 13/39] Add NLTK --- src/architecture/extensibility/nlp-survey.md | 28 ++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index b1688fe..f526b13 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -157,6 +157,34 @@ The results of the survey are below. + +### [NLTK]() + +1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [ ] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: n/a + +#### Feature matrix + +
+ +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | [-Casual module]()
[-Destructive module]()
[-Regexp Tokenizer]()
[-Repp Tokenizer]()
[-Simple Tokenizer]()
[-Stanford Segmenter]()
[-TokTok Tokenizer]()
[-Penn Treebank Tokenizer]()| [-NLTK Tokenizer/Sentence Splitter Documentation]()
[-NLTK Book Chapter 3]() | | | String | String of Text | List of Token Strings | +| Sentencing | X | [-Punkt Sent Module]()
[-Regexp Tokenizer]()
[-Simple Tokenizer]() | [-NLTK Tokenizer/Sentence Splitter Documentation]()
[-NLTK Book Chapter 3]() |X | [-Punkt Sent Module]() | String of Text | List of Sentence Strings | +| POS-tagging | X | [-Brill Tagger]()
[-CRF Tagger]()
[-HMM Tagger]()
[-HunPos Tagger]()
[-Perceptron Tagger]()
[-Senna POS Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]()| [-NLTK Tagger Documentation]()
[-NLTK Book Chapter 5]() | X| [-Brill Tagger Training]()
[-CRF Tagger]()
[-HMM Tagger]()
[HunPos Tagger]()
[-Perceptron Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]() | tokens (list(str))
sentences (list(list(str))) | list(tuple(str, str))
list(list(tuple(str, str)) ) | +| Constituency parsing | X | [-Early Chart Parser]()
[-Recursive descent Parser]()
[-Shift Reduce Parser]()
[-Standford Parser]()
| [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8]() | | | sent (list(str))
sentences (list(list(str))) | iter(Tree)
iter(iter(Tree)) | +| Dependency parsing | X | [-CoreNLP Dependency Parser]()
[-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | [-NLTK Parser Documentation]()[-NLTK Book - Chapter 8 Pargraph 5]() | X | [Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | CoreNLP: sentences (list(str)) – Input sentences to parse
Malt: sentences | CoreNLP: iter(iter(Tree))
Malt:iter(DependencyGraph) | +| Functionalities extensible | | | | | | | | +| Can consume own models | X | | | | | | | + + +
+ + ### [Pattern]() 1. [ ] Implemented in Java From b9db3d20d5675db1723d94f43239a4cfe47b8543 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Thu, 28 Jan 2021 09:34:49 +0100 Subject: [PATCH 14/39] Add AllenNLP --- src/architecture/extensibility/nlp-survey.md | 27 ++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index f526b13..f974868 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -52,6 +52,33 @@ The results of the survey are below. --- +### [AllenNLP]() + +1. [ ] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [ ] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: n/a + +#### Feature matrix + +
+ +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | [-SpacyTokenizer]()
[-PretrainedTransformerTokenizer]() | [AllenNLP Tokenizer Documentation]() | | | String of document text | List of Token Strings | +| Sentencing | X | -Sentence Splitter
-SpaCy Sentence Splitter | [Sentence Splitter API]() | | | Text as String | List of Sentence Strings | +| POS-tagging | | | | | | | | +| Constituency parsing | X | | [Constituency Parser Demo]() | | | | | +| Dependency parsing | X | | [Dependency Parser Demo]() | | | | | +| Functionalities extensible | | | | | | | | +| Can consume own models | X | | [Building own models in AllenNLP]() | | | | | + + +
+ + ### [Apache OpenNLP]() From 23a917f03d1885031b0a64740c44ec9c7f602564 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Mon, 1 Feb 2021 07:50:46 +0100 Subject: [PATCH 15/39] Add repositories --- src/architecture/extensibility/nlp-survey.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index f974868..36001cf 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -87,7 +87,7 @@ The results of the survey are below. - Can be addressed as follows: n/a 2. [x] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [ ] Is available from a p2 repository: n/a +4. [X] Is available from a p2 repository: #### Feature matrix
@@ -113,7 +113,7 @@ The results of the survey are below. - Can be addressed as follows: n/a 2. [X] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [ ] Is available from a p2 repository: n/a +4. [X] Is available from a p2 repository: #### Feature matrix @@ -165,7 +165,7 @@ The results of the survey are below. - Can be addressed as follows: n/a 2. [X] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [ ] Is available from a p2 repository: +4. [X] Is available from a p2 repository: #### Feature matrix @@ -246,7 +246,7 @@ The results of the survey are below. - Can be addressed as follows: n/a 2. [?] Uses Maven (`pom.xml` exists) 3. [X] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [?] Is available from a p2 repository: ? +4. [X] Is available from a p2 repository: #### Feature matrix @@ -301,7 +301,7 @@ The results of the survey are below. - Can be addressed as follows: [Using Spark NLP via Scala and Maven]() 2. [X] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [ ] Is available from a p2 repository: ? https://mvnrepository.com/artifact/JohnSnowLabs/spark-nlp +4. [X] Is available from a p2 repository: #### Feature matrix @@ -329,7 +329,7 @@ The results of the survey are below. - Can be addressed as follows: n/a 2. [X] Uses Maven (`pom.xml` exists) 3. [X] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [ ] Is available from a p2 repository: n/a +4. [X] Is available from a p2 repository: #### Feature matrix From d8cdb81a440c172f11ad5660ae3787d7fce4f7db Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Fri, 5 Feb 2021 13:21:09 +0100 Subject: [PATCH 16/39] Add NER Information --- src/architecture/extensibility/nlp-survey.md | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 36001cf..7c93997 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -72,6 +72,7 @@ The results of the survey are below. | POS-tagging | | | | | | | | | Constituency parsing | X | | [Constituency Parser Demo]() | | | | | | Dependency parsing | X | | [Dependency Parser Demo]() | | | | | +| Named Entity Recognition | | Functionalities extensible | | | | | | | | | Can consume own models | X | | [Building own models in AllenNLP]() | | | | | @@ -99,7 +100,8 @@ The results of the survey are below. | POS-tagging| X | | | x | | string array of tokens | string array of tags | | Constituency parsing | X | | | X | | String of whitespace tokenized Sentence | array of OpenNLP's Parse Type | | Dependency parsing | | | | | | | | -|Functionalities extensible|X| | | | | | +| Named Entity Recognition | X | | [NER Documentaion]()| X | [NER Training Documentation] () | | | +|Functionalities extensible|X| | | |String of Text | Span Array | | Can consume own models | | | | | | | |
@@ -126,6 +128,7 @@ The results of the survey are below. | POS-tagging | X | | [General description]() | | | String of raw text | `TextAnnotation` data structure with `View` for each Annotation | | Constituency parsing | X | | [General description]()
[Parser documentation]( ) | | | String of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each Annotation | | Dependency parsing | X | -Stanford NLP Parser
-CogComp Parser (requires POS-Tagger and Chunker as part of the Pipeline) | [General description]()
[Stanford Parser documentation]( ) | | | String of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each Annotation | +| Named Entity Recognition | X (isn't part of the Pipeline) | | [NER Documentation]() | X | [Training Documentation]() | String of raw text | `TextAnnotation` data structure with `View` for each Annotation | | Functionalities extensible | X | | [Configuration documentation]() | | | | | | Can consume own models | X | | [Use own Tokenizer]() | | | | | @@ -152,6 +155,7 @@ The results of the survey are below. | POS-tagging | X | X | [POS-Tagger Documentation]() | https://intellabs.github.io/nlp-architect/tagging/sequence_tagging.html#custom-training-parameters | | | | Constituency parsing | | | | | | | | Dependency parsing | X | X | [DepParse Documentation]() | | | - Filepath with POS-tagged Dataset in CONNLL-U format
-list of list of `ConllEntry`, where each entry represents a POS-tagged Token and each nested List a sentence | | +| Named Entity Recognition | X | [Neural Tagger- CNNLSTM module]()
[Neural Tagger- IDCNN]()
[Transformer Token Classifier]() | [NER Documentation]() | X | [NER Training Documentation]() | | | | Trainable models | X | | | | | | | Can consume own models | X | | | | | | @@ -178,6 +182,7 @@ The results of the survey are below. | POS-tagging | X | | [POS-Tag basic information]() | X | [Basic Training information]() | -string line of text
-Raw document text
tsv-format | Output-file with annotations| | Constituency parsing | | | | | | | | | Dependency parsing | X | | [Dependency Parse basic information]() | X | [Basic Training information]() | -string line of text
-Raw document text
tsv-format | Output-file with annotations | +| Named Entity Recognition | X | | [NER Documentation]()| X | [Basic Training information]() | -string line of text
-Raw document text
tsv-format | Output-file with annotations | | Functionalities extensible | | | | | | | | | Can consume own models | | | | | | | | @@ -205,6 +210,7 @@ The results of the survey are below. | POS-tagging | X | [-Brill Tagger]()
[-CRF Tagger]()
[-HMM Tagger]()
[-HunPos Tagger]()
[-Perceptron Tagger]()
[-Senna POS Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]()| [-NLTK Tagger Documentation]()
[-NLTK Book Chapter 5]() | X| [-Brill Tagger Training]()
[-CRF Tagger]()
[-HMM Tagger]()
[HunPos Tagger]()
[-Perceptron Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]() | tokens (list(str))
sentences (list(list(str))) | list(tuple(str, str))
list(list(tuple(str, str)) ) | | Constituency parsing | X | [-Early Chart Parser]()
[-Recursive descent Parser]()
[-Shift Reduce Parser]()
[-Standford Parser]()
| [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8]() | | | sent (list(str))
sentences (list(list(str))) | iter(Tree)
iter(iter(Tree)) | | Dependency parsing | X | [-CoreNLP Dependency Parser]()
[-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | [-NLTK Parser Documentation]()[-NLTK Book - Chapter 8 Pargraph 5]() | X | [Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | CoreNLP: sentences (list(str)) – Input sentences to parse
Malt: sentences | CoreNLP: iter(iter(Tree))
Malt:iter(DependencyGraph) | +| Named Entity Recognition | X | | [NLTK Book Chapter 7 Paragraph 5](< https://www.nltk.org/book/ch07.html>)
[-NE Chunker]()| X | [-NE Chunker]()| list of Pos-Tagged tokens | | | Functionalities extensible | | | | | | | | | Can consume own models | X | | | | | | | @@ -232,6 +238,7 @@ The results of the survey are below. | POS-tagging | X | | [General documentation]() | | | String of text | nested list : [sentences[chunks[tokens]]] | | Constituency parsing | | | | | | | | | Dependency parsing | | | | | | | | +| Named Entity Recognition | | | | | | | | | Functionalities extensible | | | | | | | | Can consume own models | | | | | | | | @@ -259,6 +266,7 @@ The results of the survey are below. | POS-tagging | X | | | | | | | Constituency parsing | | | | | | | | Dependency parsing | | | | | | | +| Named Entity Recognition | X | | [NER Documentation]()| | | | | | Trainable models | | | | | | | | Can consume own models | | | | | | | @@ -287,6 +295,7 @@ The results of the survey are below. | POS-tagging | X | | [POS-Tagger Documentation]() |X | [Training Basics]()
[Updating POS-Tagger]() | Spacy's `Doc` type |Spacy's `Doc` type - POS-Tags access through`token.pos_` attribute| | Constituency parsing | | | | | | | | | Dependency parsing | X | | [Parser Documentation]() | X| [Training Basics]()
[Updating Dependencs Parser]() | Spacy's `Doc` type | Spacy's `Doc` type - Parse-tags access through `token.dep_` attribute | +| Named Entity Recognition | X | | [NER Documentation]()| | |SpaCy's `Doc` type | SpaCy's `Doc` type - entitytags access through `doc.ents` property| | Functionalities extensible | X | | [Extend Tokenizer]()
[Customize Tokenizer]()
[Custom Component]() | | | | | Can consume own models | (X) | | [Load different Tokenizer]() | | | | @@ -313,7 +322,8 @@ The results of the survey are below. | Sentencing | X | [^2] | [Sentence Detector Documentation]() | | | Document | Sentences | | POS-tagging | X | [^2] | [POS-Tagger Documentation]() | X | [POS Training Documentation]()
[General Training Documentation]()| Document, Token | POS | | Constituency parsing | | | | | | | | -| Dependency parsing | X | -[Typed Dependency Parser]()
[Untyped Dependency Parser]() | [Dependency Parser Documentation]()| X | [General Training Documentation]() | Document,POS,Token | (unlabeled) Dependeny | +| Dependency parsing | X | [-Typed Dependency Parser]()
[Untyped Dependency Parser]() | [Dependency Parser Documentation]()| X | [General Training Documentation]() | Document,POS,Token | (unlabeled) Dependeny | +| Named Entity Recognition | X | [-NER CRF]()
[-NER DL]()| [NER Documentation](<>)| X | [-NER CRF]()
[-NER DL]() | | | | Functionalities extensible | X | | [Manipulating Pipelines]() | | | | | | Can consume own models | X | | [General Concept Documentation]() | | | | | @@ -339,9 +349,10 @@ The results of the survey are below. |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| | Tokenization/segmentation | X | see [list]() of classes implementing the Tokenizer interface | [Tokenizer Documentation]() | | | -string of document text
-CoreNLPs `CoreDocument` (Instatiated with string document text) | -list of strings
-list of characteroffsetbegin indices
-list of characteroffsetendindices
`CoreDocument` with previous annotation properties | | Sentencing | X | | [Sentencer Documentation]() | | | tokenized `CoreDocument` | `CoreDocument` with Sentence List of POS-Tags as property | -| POS-tagging | X | | [POS-Tag Documentation]() | | | tokenized and sentence-splitted `CoreDocument` | `CoreDocument` with String List of POS-Tags as property | +| POS-tagging | X | | [POS-Tag Documentation]() | | | tokenized and sentence-splitted `CoreDocument` | `CoreDocument` with String List of POS-Tags as property | | Constituency parsing | X | [Viterbi Parser]()
[Shift reduce Parser]()
[Iterative CKYPCFG Parser]()
[Fast Factored Parser]()
[Exhaustive PCFG Parser]() | [Constituency Parser Documentation]() | | | tokenized, sentence-splitted (and for some models POS-tagged) `CoreDocument` | `CoreDocument` with TreeAnnotation (exact form depends on chosen parser) | | Dependency parsing | X |[BiLexPCFGParser]()
[Exhaustive Dependency Parser]() |[DepParse Documentation]() | X | [Train own Model]() |tokenized, sentence-splitted and POS-tagged `CoreDocument` | `CoreDocument` with DependencyAnnotation (exact form depends on chosen parser) | +| Named Entity Recognition | X | [NER Classifier Combiner](<>https://stanfordnlp.github.io/CoreNLP/ner.html)
[-RegexNERAnnotator]() | [NER Documentation]()| X | [NER Training Documentation]() | tokenized, ssplitted, pos-tagged, (lemmatized) `CoreDocument` | `CoreDocument` with `NamedEntityTagAnnotation` or `NormalizedNamedEntityTagAnnotation`| | Functionalities extensible | X | [Custom annotator]() | | | | | Can consume own models | X | | [Example of including own (caseless) model]() | | | | @@ -369,7 +380,8 @@ The results of the survey are below. | POS-tagging | X | -PatternTagger
-NLTKTagger | [POS-Tagger Tutorial]()
[POS-Tagger Advanced Usage]() | | | String of raw text | `TextBlob` data type - access through `tags` property: List of word string tag string tuples | | Constituency parsing | X | | [Parser Tutorial]()
[Parser Advanced Usage]()| | | String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | | Dependency parsing | X | |[Parser Tutorial]()
[Parser Advanced Usage]()| | |String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | -|Functionalities extensible | | | | | | | | +| Named Entity Recognition | | | | | | | | +| Functionalities extensible | | | | | | | | | Can consume own models | X | | [Passing models into the Pipeline]()
[Training own data]() | | | | | From 0f6025cf50e7601f59f4cdacd073b11579869e38 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Fri, 5 Feb 2021 13:22:28 +0100 Subject: [PATCH 17/39] Revise Lingpipe --- src/architecture/extensibility/nlp-survey.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 7c93997..603fa48 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -246,6 +246,7 @@ The results of the survey are below. + ### [Lingpipe]() 1. [X] Implemented in Java @@ -259,16 +260,16 @@ The results of the survey are below.
-| | Has functionality | Functionality extensible | Functionality documentation | Extension documentation | Input data | Output data | -|---------------------------|--------------------------|--------------------------|-----------------------------|------------------------------|--------------------------------------|-------------| -| Tokenization/segmentation | X | | Chapter 3 (p.33)

- | | -String of text
-Character array, Startindex, Endindex | | -| Sentencing | X | | | | | | -| POS-tagging | X | | | | | | +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | [-IndoEuropeanTokenizer]()
[-CharacterTokenizer]()
[-RegExTokenizer]()
[-NGramTokenizer]()
[-LineTokenizerFactory]() | [Lingpipe Book]() Chapter 3 (p.33)
[Tokenizer API]()
[Tokenization API]( )| | | -String of text
-Character array, Startindex, Endindex | | +| Sentencing | X | | [Sentence-Splitter Tutorial]( )| X | | | | +| POS-tagging | X | [-Chain CRF Tagger]()
[-Classifier Tagger]()
[-HMMDecoder]() | [-Lingpipe Book Chapter 11]()
[-POS-Tagger Tutorial]() | X | [POS-Tagger Tutorial Paragraph: Training Training Part-of-Speech Models]() | List tokenList | [Tagging]() | | Constituency parsing | | | | | | | -| Dependency parsing | | | | | | | +| Dependency parsing | | | | | | | | | Named Entity Recognition | X | | [NER Documentation]()| | | | | -| Trainable models | | | | | | | -| Can consume own models | | | | | | | +| Functionalities extensible | X | | [Paragraph 3. Evaluating and Tuning Tagging Models]() | | | | | +| Can consume own models | X | | [Developing and Tuning Sentence Models]() | | | |
From b979ad89d629db61ece486199b4b492781dd8bf9 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Fri, 5 Feb 2021 13:28:11 +0100 Subject: [PATCH 18/39] Add GATE & Annie --- src/architecture/extensibility/nlp-survey.md | 30 ++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 603fa48..35d334e 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -135,6 +135,36 @@ The results of the survey are below. + +### [GATE & ANNIE]() + +1. [X] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [X] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [ ] Is available from a p2 repository: + +#### Feature matrix + +
+ +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | | [ANNIE Tokenizer Documentation]( )| | | | | +| Sentencing | X | -ANNIE Default Sentence Splitter
-ANNIE Regex Sentence Splitter | [ANNIE Sentence Splitter Documentation]( ) | | | | +| POS-tagging | X | | [ANNIE POS-Tagger Documentation]() | | | | +| Constituency parsing | | | | | | | +| Dependency parsing | | | | | | | +| Named Entity Recognition | x | | [Gazeteer documentation]() | | | | | +| Trainable models | | | | | | | +| Can consume own models | | | | | | | + + +
+ + + ### [NLP Architect]() 1. [ ] Implemented in Java From d1f3f001fa46a4fb0f2a3b1315b59979b4339747 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Mon, 8 Feb 2021 14:17:12 +0100 Subject: [PATCH 19/39] Remove template --- src/architecture/extensibility/nlp-survey.md | 28 -------------------- 1 file changed, 28 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 35d334e..bcf8aaa 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -23,34 +23,6 @@ Specifically, the following properties were checked: The results of the survey are below. --- - - -### [library-name]() - -1. [ ] Implemented in Java - 1. [ ] Not Java, but API can be addressed from Java - - Can be addressed as follows: n/a -2. [ ] Uses Maven (`pom.xml` exists) -3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [ ] Is available from a p2 repository: n/a - -#### Feature matrix - -
- -| | Has functionality | Functionality extensible | Functionality documentation | Extension documentation | Input data | Output data | -|---------------------------|--------------------------|--------------------------|-----------------------------|------------------------------|--------------------------------------|-------------| -| Tokenization/segmentation | | | | | | | -| Sentencing | | | | | | | -| POS-tagging | | | | | | | -| Constituency parsing | | | | | | | -| Dependency parsing | | | | | | | -| Trainable models | | | | | | | -| Can consume own models | | | | | | | - - -
- --- ### [AllenNLP]() From 96ff84999dc54ceb226f09ca714dc7fd9c1fa382 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Mon, 8 Feb 2021 14:49:35 +0100 Subject: [PATCH 20/39] Revise AllenNLP --- src/architecture/extensibility/nlp-survey.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index bcf8aaa..107c542 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -28,7 +28,7 @@ The results of the survey are below. 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java - - Can be addressed as follows: n/a + - Can be addressed as follows: there are some [taggers]() and [NLPstack]() available as Scala-packages. AllenNLP also provides [Docker images]() 2. [ ] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) 4. [ ] Is available from a p2 repository: n/a @@ -39,12 +39,12 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-SpacyTokenizer]()
[-PretrainedTransformerTokenizer]() | [AllenNLP Tokenizer Documentation]() | | | String of document text | List of Token Strings | -| Sentencing | X | -Sentence Splitter
-SpaCy Sentence Splitter | [Sentence Splitter API]() | | | Text as String | List of Sentence Strings | -| POS-tagging | | | | | | | | +| Tokenization/segmentation | X | [-SpacyTokenizer]()
[-PretrainedTransformerTokenizer]() | [AllenNLP Tokenizer Documentation]() | | | Text as str | List[Token] | +| Sentencing | X | -Sentence Splitter
-SpaCy Sentence Splitter | [Sentence Splitter API]() | | | Text as str | List[str] | +| POS-tagging | | | | X | [Train SentenceTaggerPredictor]() | Sentence as str | Dict[str, numpy.ndarray] | | Constituency parsing | X | | [Constituency Parser Demo]() | | | | | | Dependency parsing | X | | [Dependency Parser Demo]() | | | | | -| Named Entity Recognition | +| Named Entity Recognition | | | | X | [Train SentenceTaggerPredictor]() | Sentence as str | Dict[str, numpy.ndarray] | | Functionalities extensible | | | | | | | | | Can consume own models | X | | [Building own models in AllenNLP]() | | | | | From a296e656c9f9c9871a921bd4b19ae99c9a9f62c3 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Tue, 9 Feb 2021 09:37:44 +0100 Subject: [PATCH 21/39] Revise OpenNLP --- src/architecture/extensibility/nlp-survey.md | 27 ++++++++++---------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 107c542..2ebb7bb 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -28,7 +28,7 @@ The results of the survey are below. 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java - - Can be addressed as follows: there are some [taggers]() and [NLPstack]() available as Scala-packages. AllenNLP also provides [Docker images]() + - Can be addressed as follows: There are some [taggers]() and [NLPstack]() available as Scala-packages. AllenNLP also provides [Docker images]() 2. [ ] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) 4. [ ] Is available from a p2 repository: n/a @@ -39,8 +39,8 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-SpacyTokenizer]()
[-PretrainedTransformerTokenizer]() | [AllenNLP Tokenizer Documentation]() | | | Text as str | List[Token] | -| Sentencing | X | -Sentence Splitter
-SpaCy Sentence Splitter | [Sentence Splitter API]() | | | Text as str | List[str] | +| Tokenization/segmentation | X | [-SpacyTokenizer]()
[-PretrainedTransformerTokenizer]() | [Tokenizer Documentation]() | | | Text as str | List[Token] | +| Sentencing | X | -Sentence Splitter
-SpaCy Sentence Splitter | [Sentence-Splitter API]() | | | Text as str | List[str] | | POS-tagging | | | | X | [Train SentenceTaggerPredictor]() | Sentence as str | Dict[str, numpy.ndarray] | | Constituency parsing | X | | [Constituency Parser Demo]() | | | | | | Dependency parsing | X | | [Dependency Parser Demo]() | | | | | @@ -51,7 +51,7 @@ The results of the survey are below. - +--- ### [Apache OpenNLP]() @@ -60,25 +60,26 @@ The results of the survey are below. - Can be addressed as follows: n/a 2. [x] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [X] Is available from a p2 repository: +4. [X] Is available from a p2 repository: [OpenNLP Repository]() #### Feature matrix
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable | Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation |X|-Whitespace Tokenizer
-Character Tokenizer
-Maximum Entropy Tokenizer || X | |string of (untokenized) text | array of strings
-array of token spans| -| Sentencing|X| || X | |string of document text | -array of strings
-array of sentence spans | -| POS-tagging| X | | | x | | string array of tokens | string array of tags | -| Constituency parsing | X | | | X | | String of whitespace tokenized Sentence | array of OpenNLP's Parse Type | +| Tokenization/segmentation |X|[-Whitespace Tokenizer]()
[-Character Tokenizer]()
[-Maximum Entropy Tokenizer]() |[Tokenizer Documentation]()| X | [Tokenizer Training Documentation]() |Text as String | -Array of Strings
-Array of Token Spans| +| Sentencing|X|[-Newline Sentence-Splitter]()
[-Maximum Entropy Sentence-Splitter]() |[Sentence-Splitter Documentation]()| X | [Sentence-Splitter Training Documentation]() |Text as String | -Array of Strings
-Array of sentence Spans | +| POS-tagging| X | | [POS-Tagger Documentation]() | x | [POS-Tagger Training Documentation]() | String Array of Tokens | String Array of POS-tags | +| Constituency parsing | X | | [Constituency Parser Documentation]() | X | [Constituency Parser Documentation]()| String of whitespace tokenized Sentence | Array of OpenNLP's Parse Type | | Dependency parsing | | | | | | | | -| Named Entity Recognition | X | | [NER Documentaion]()| X | [NER Training Documentation] () | | | -|Functionalities extensible|X| | | |String of Text | Span Array | -| Can consume own models | | | | | | | | +| Named Entity Recognition | X | | [NER Documentaion]()| X | [NER Training Documentation]() | String Array of Tokens | Array of name Spans | +|Functionalities extensible|X| | [Extension writing Documentation]()| | | | +| Can consume own models | X | | [General API description]() | | | | |
- + +--- ### [CogComp NLP Pipeline]() From c0b9e04577fd13768a6f45147f3df28ba729c6fd Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Tue, 9 Feb 2021 09:41:08 +0100 Subject: [PATCH 22/39] Revise CogComp NLP Pipeline --- src/architecture/extensibility/nlp-survey.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 2ebb7bb..5faf5ad 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -98,10 +98,10 @@ The results of the survey are below. |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| | Tokenization/segmentation | X | | [General description]() | | | String of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each Annotation | | Sentencing | | | | | | | | -| POS-tagging | X | | [General description]() | | | String of raw text | `TextAnnotation` data structure with `View` for each Annotation | +| POS-tagging | X | | [General description]() | | | String of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each Annotation | | Constituency parsing | X | | [General description]()
[Parser documentation]( ) | | | String of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each Annotation | -| Dependency parsing | X | -Stanford NLP Parser
-CogComp Parser (requires POS-Tagger and Chunker as part of the Pipeline) | [General description]()
[Stanford Parser documentation]( ) | | | String of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each Annotation | -| Named Entity Recognition | X (isn't part of the Pipeline) | | [NER Documentation]() | X | [Training Documentation]() | String of raw text | `TextAnnotation` data structure with `View` for each Annotation | +| Dependency parsing | X | -Stanford NLP Parser
-CogComp Parser (requires POS-Tagger and Chunker as part of the Pipeline) | [-General description]()
[-Stanford Parser documentation]( ) | | | String of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each Annotation | +| Named Entity Recognition | X (isn't part of the Pipeline) | | [NER Documentation]() | X | [Training Documentation]() | String of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each Annotation | | Functionalities extensible | X | | [Configuration documentation]() | | | | | | Can consume own models | X | | [Use own Tokenizer]() | | | | | From c40cb63aceae95a2cbf83d98dd6c8710872c0148 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Tue, 9 Feb 2021 10:34:12 +0100 Subject: [PATCH 23/39] Revise Architect NLP --- src/architecture/extensibility/nlp-survey.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 5faf5ad..016c6c4 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -124,13 +124,13 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | | [ANNIE Tokenizer Documentation]( )| | | | | +| Tokenization/segmentation | X | | [ANNIE Tokenizer Documentation]( )| | | string of document text | output file | | Sentencing | X | -ANNIE Default Sentence Splitter
-ANNIE Regex Sentence Splitter | [ANNIE Sentence Splitter Documentation]( ) | | | | | POS-tagging | X | | [ANNIE POS-Tagger Documentation]() | | | | | Constituency parsing | | | | | | | | Dependency parsing | | | | | | | -| Named Entity Recognition | x | | [Gazeteer documentation]() | | | | | -| Trainable models | | | | | | | +| Named Entity Recognition | x | | [Gazeteer documentation]()
[-Semantic Tagger]() | | | | | +| Functionalities extensible | | | | | | | | Can consume own models | | | | | | | @@ -142,7 +142,7 @@ The results of the survey are below. 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java - - Can be addressed as follows: n/a + - Can be addressed as follows: to the best of our knowledge it's only accesible via common ways to integrate Python scripts in Java (e.g. JEPP, PythonInterpreter, Runtime.exec(),..) 2. [ ] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) 4. [ ] Is available from a p2 repository: n/a @@ -155,7 +155,7 @@ The results of the survey are below. |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| | Tokenization/segmentation | | | | | | | | Sentencing | | | | | | | -| POS-tagging | X | X | [POS-Tagger Documentation]() | https://intellabs.github.io/nlp-architect/tagging/sequence_tagging.html#custom-training-parameters | | | +| POS-tagging | X | X | [POS-Tagger Documentation]() | X|[POS-Tagger Training Documentation]() | | | | Constituency parsing | | | | | | | | Dependency parsing | X | X | [DepParse Documentation]() | | | - Filepath with POS-tagged Dataset in CONNLL-U format
-list of list of `ConllEntry`, where each entry represents a POS-tagged Token and each nested List a sentence | | | Named Entity Recognition | X | [Neural Tagger- CNNLSTM module]()
[Neural Tagger- IDCNN]()
[Transformer Token Classifier]() | [NER Documentation]() | X | [NER Training Documentation]() | | | From feea0fa7d7eaaa051e8c71e1bf5d85044fed2ae9 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Tue, 9 Feb 2021 10:49:01 +0100 Subject: [PATCH 24/39] Revise NLTK --- src/architecture/extensibility/nlp-survey.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 016c6c4..1dcff66 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -197,7 +197,7 @@ The results of the survey are below. 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java - - Can be addressed as follows: n/a + - Can be addressed as follows: to the best of our knowledge it's only accesible via common ways to integrate Python scripts in Java (e.g. JEPP, PythonInterpreter, Runtime.exec(),..) 2. [ ] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) 4. [ ] Is available from a p2 repository: n/a @@ -208,12 +208,12 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-Casual module]()
[-Destructive module]()
[-Regexp Tokenizer]()
[-Repp Tokenizer]()
[-Simple Tokenizer]()
[-Stanford Segmenter]()
[-TokTok Tokenizer]()
[-Penn Treebank Tokenizer]()| [-NLTK Tokenizer/Sentence Splitter Documentation]()
[-NLTK Book Chapter 3]() | | | String | String of Text | List of Token Strings | -| Sentencing | X | [-Punkt Sent Module]()
[-Regexp Tokenizer]()
[-Simple Tokenizer]() | [-NLTK Tokenizer/Sentence Splitter Documentation]()
[-NLTK Book Chapter 3]() |X | [-Punkt Sent Module]() | String of Text | List of Sentence Strings | -| POS-tagging | X | [-Brill Tagger]()
[-CRF Tagger]()
[-HMM Tagger]()
[-HunPos Tagger]()
[-Perceptron Tagger]()
[-Senna POS Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]()| [-NLTK Tagger Documentation]()
[-NLTK Book Chapter 5]() | X| [-Brill Tagger Training]()
[-CRF Tagger]()
[-HMM Tagger]()
[HunPos Tagger]()
[-Perceptron Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]() | tokens (list(str))
sentences (list(list(str))) | list(tuple(str, str))
list(list(tuple(str, str)) ) | -| Constituency parsing | X | [-Early Chart Parser]()
[-Recursive descent Parser]()
[-Shift Reduce Parser]()
[-Standford Parser]()
| [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8]() | | | sent (list(str))
sentences (list(list(str))) | iter(Tree)
iter(iter(Tree)) | -| Dependency parsing | X | [-CoreNLP Dependency Parser]()
[-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | [-NLTK Parser Documentation]()[-NLTK Book - Chapter 8 Pargraph 5]() | X | [Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | CoreNLP: sentences (list(str)) – Input sentences to parse
Malt: sentences | CoreNLP: iter(iter(Tree))
Malt:iter(DependencyGraph) | -| Named Entity Recognition | X | | [NLTK Book Chapter 7 Paragraph 5](< https://www.nltk.org/book/ch07.html>)
[-NE Chunker]()| X | [-NE Chunker]()| list of Pos-Tagged tokens | | +| Tokenization/segmentation | X | [-Casual module]()
[-Destructive module]()
[-Regexp Tokenizer]()
[-Repp Tokenizer]()
[-Simple Tokenizer]()
[-Stanford Segmenter]()
[-TokTok Tokenizer]()
[-Penn Treebank Tokenizer]()| [-NLTK Tokenizer/Sentence Splitter Documentation]()
[-NLTK Book Chapter 3]() | | | str of text | list(str) of token | +| Sentencing | X | [-Punkt Sent Module]()
[-Regexp Tokenizer]()
[-Simple Tokenizer]() | [-NLTK Tokenizer/Sentence Splitter Documentation]()
[-NLTK Book Chapter 3]() |X | [-Punkt Sent Module]() | str of text | list(str) of sentences | +| POS-tagging | X | [-Brill Tagger]()
[-CRF Tagger]()
[-HMM Tagger]()
[-HunPos Tagger]()
[-Perceptron Tagger]()
[-Senna POS Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]()| [-NLTK Tagger Documentation]()
[-NLTK Book Chapter 5]() | X| [-Brill Tagger Training]()
[-CRF Tagger]()
[-HMM Tagger]()
[HunPos Tagger]()
[-Perceptron Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]() | -tokens (list(str))
-sentences (list(list(str))) | -list(tuple(str, str))
-list(list(tuple(str, str)) ) | +| Constituency parsing | X | [-Early Chart Parser]()
[-Recursive descent Parser]()
[-Shift Reduce Parser]()
[-Standford Parser]()
| [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8]() | | | sent (list(str))
sentences (list(list(str))) | -iter(Tree)
-iter(iter(Tree)) | +| Dependency parsing | X | [-CoreNLP Dependency Parser]()
[-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8 Pargraph 5]() | X | [-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | -CoreNLP: sentences (list(str)) – Input sentences to parse
-Malt: str sentences | CoreNLP: iter(iter(Tree))
Malt:iter(DependencyGraph) | +| Named Entity Recognition | X | | [-NLTK Book Chapter 7 Paragraph 5](< https://www.nltk.org/book/ch07.html>)
[-NE Chunker]()| X | [-NE Chunker]()| list of POS-Tagged tokens | NE-tagged parse-tree | | Functionalities extensible | | | | | | | | | Can consume own models | X | | | | | | | From 7fb59e5f4d789f98544793b624d7451a8a2e8486 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Tue, 9 Feb 2021 11:00:41 +0100 Subject: [PATCH 25/39] Revise Pattern --- src/architecture/extensibility/nlp-survey.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 1dcff66..074419f 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -225,7 +225,7 @@ The results of the survey are below. 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java - - Can be addressed as follows: n/a + - Can be addressed as follows: to the best of our knowledge it's only accesible via common ways to integrate Python scripts in Java (e.g. JEPP, PythonInterpreter, Runtime.exec(),..) 2. [ ] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) 4. [ ] Is available from a p2 repository: ? From b3ec1d0195505e019f0dba0c87f63bfbdc14b1ba Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Tue, 9 Feb 2021 11:09:30 +0100 Subject: [PATCH 26/39] Revise Lingpipe --- src/architecture/extensibility/nlp-survey.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 074419f..8c4f4af 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -265,8 +265,8 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-IndoEuropeanTokenizer]()
[-CharacterTokenizer]()
[-RegExTokenizer]()
[-NGramTokenizer]()
[-LineTokenizerFactory]() | [Lingpipe Book]() Chapter 3 (p.33)
[Tokenizer API]()
[Tokenization API]( )| | | -String of text
-Character array, Startindex, Endindex | | -| Sentencing | X | | [Sentence-Splitter Tutorial]( )| X | | | | +| Tokenization/segmentation | X | [-IndoEuropeanTokenizer]()
[-CharacterTokenizer]()
[-RegExTokenizer]()
[-NGramTokenizer]()
[-LineTokenizerFactory]() | [Lingpipe Book]() Chapter 3 (p.33)
[Tokenizer API]()
[Tokenization API]( )| | | -String of text
-Character array, Startindex, Endindex | array of token strings | +| Sentencing | X | | [Sentence-Splitter Tutorial]( )| X | | Char[] array of text | Sentences as set of chunks | | POS-tagging | X | [-Chain CRF Tagger]()
[-Classifier Tagger]()
[-HMMDecoder]() | [-Lingpipe Book Chapter 11]()
[-POS-Tagger Tutorial]() | X | [POS-Tagger Tutorial Paragraph: Training Training Part-of-Speech Models]() | List tokenList | [Tagging]() | | Constituency parsing | | | | | | | | Dependency parsing | | | | | | | | From 55cc534a3107d292c69156c664fe0d0d9c390555 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Tue, 9 Feb 2021 11:13:22 +0100 Subject: [PATCH 27/39] Revise SpaCy --- src/architecture/extensibility/nlp-survey.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 8c4f4af..1e9abfe 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -283,7 +283,7 @@ The results of the survey are below. 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java - - Can be addressed as follows: n/a + - Can be addressed as follows: [SpaCy Container & APIs]() 2. [ ] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) 4. [ ] Is available from a p2 repository: n/a From f8ac2bb57f20175dff28324c31e6ee5433fe0baf Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Tue, 9 Feb 2021 11:14:14 +0100 Subject: [PATCH 28/39] Fix alphabetical order --- src/architecture/extensibility/nlp-survey.md | 52 ++++++++++---------- 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 1e9abfe..5df21a8 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -136,6 +136,32 @@ The results of the survey are below. +### [Lingpipe]() + +1. [X] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [?] Uses Maven (`pom.xml` exists) +3. [X] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [X] Is available from a p2 repository: + +#### Feature matrix + +
+ +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | [-IndoEuropeanTokenizer]()
[-CharacterTokenizer]()
[-RegExTokenizer]()
[-NGramTokenizer]()
[-LineTokenizerFactory]() | [Lingpipe Book]() Chapter 3 (p.33)
[Tokenizer API]()
[Tokenization API]( )| | | -String of text
-Character array, Startindex, Endindex | array of token strings | +| Sentencing | X | | [Sentence-Splitter Tutorial]( )| X | | Char[] array of text | Sentences as set of chunks | +| POS-tagging | X | [-Chain CRF Tagger]()
[-Classifier Tagger]()
[-HMMDecoder]() | [-Lingpipe Book Chapter 11]()
[-POS-Tagger Tutorial]() | X | [POS-Tagger Tutorial Paragraph: Training Training Part-of-Speech Models]() | List tokenList | [Tagging]() | +| Constituency parsing | | | | | | | +| Dependency parsing | | | | | | | | +| Named Entity Recognition | X | | [NER Documentation]()| | | | | +| Functionalities extensible | X | | [Paragraph 3. Evaluating and Tuning Tagging Models]() | | | | | +| Can consume own models | X | | [Developing and Tuning Sentence Models]() | | | | + + +
### [NLP Architect]() @@ -250,32 +276,6 @@ The results of the survey are below. -### [Lingpipe]() - -1. [X] Implemented in Java - 1. [ ] Not Java, but API can be addressed from Java - - Can be addressed as follows: n/a -2. [?] Uses Maven (`pom.xml` exists) -3. [X] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [X] Is available from a p2 repository: - -#### Feature matrix - -
- -| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-IndoEuropeanTokenizer]()
[-CharacterTokenizer]()
[-RegExTokenizer]()
[-NGramTokenizer]()
[-LineTokenizerFactory]() | [Lingpipe Book]() Chapter 3 (p.33)
[Tokenizer API]()
[Tokenization API]( )| | | -String of text
-Character array, Startindex, Endindex | array of token strings | -| Sentencing | X | | [Sentence-Splitter Tutorial]( )| X | | Char[] array of text | Sentences as set of chunks | -| POS-tagging | X | [-Chain CRF Tagger]()
[-Classifier Tagger]()
[-HMMDecoder]() | [-Lingpipe Book Chapter 11]()
[-POS-Tagger Tutorial]() | X | [POS-Tagger Tutorial Paragraph: Training Training Part-of-Speech Models]() | List tokenList | [Tagging]() | -| Constituency parsing | | | | | | | -| Dependency parsing | | | | | | | | -| Named Entity Recognition | X | | [NER Documentation]()| | | | | -| Functionalities extensible | X | | [Paragraph 3. Evaluating and Tuning Tagging Models]() | | | | | -| Can consume own models | X | | [Developing and Tuning Sentence Models]() | | | | - - -
From fa375fcca7700f3b3300539b9d81ce0f3f848a67 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Tue, 9 Feb 2021 11:21:25 +0100 Subject: [PATCH 29/39] Revise SparkNLP --- src/architecture/extensibility/nlp-survey.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 5df21a8..42b7fdf 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -322,13 +322,13 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [^2] | [Tokenizer Documentation]() | | | Document | Token | -| Sentencing | X | [^2] | [Sentence Detector Documentation]() | | | Document | Sentences | -| POS-tagging | X | [^2] | [POS-Tagger Documentation]() | X | [POS Training Documentation]()
[General Training Documentation]()| Document, Token | POS | +| Tokenization/segmentation | X | [^2] | [Tokenizer Documentation]() | | | `Document` Type | `Token` Type | +| Sentencing | X | [^2] | [Sentence Detector Documentation]() | | | `Document` Type | `Sentence` Types | +| POS-tagging | X | [^2] | [POS-Tagger Documentation]() | X | [POS Training Documentation]()
[General Training Documentation]()| `Document` Type
`Token`Type | `POS` Type | | Constituency parsing | | | | | | | | -| Dependency parsing | X | [-Typed Dependency Parser]()
[Untyped Dependency Parser]() | [Dependency Parser Documentation]()| X | [General Training Documentation]() | Document,POS,Token | (unlabeled) Dependeny | +| Dependency parsing | X | [-Typed Dependency Parser]()
[Untyped Dependency Parser]() | [Dependency Parser Documentation]()| X | [General Training Documentation]() | `Document` Type
`POS` Type
`Token` Type | (unlabeled) `Dependency` Type | | Named Entity Recognition | X | [-NER CRF]()
[-NER DL]()| [NER Documentation](<>)| X | [-NER CRF]()
[-NER DL]() | | | -| Functionalities extensible | X | | [Manipulating Pipelines]() | | | | | +| Functionalities extensible | X | | [Manipulating Pipelines]() | | | `Document` Type | `Named_Entity` Type | | Can consume own models | X | | [General Concept Documentation]() | | | | | From b3a96d33fed21f82d23f82894d346441dcb789ed Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Tue, 9 Feb 2021 11:22:31 +0100 Subject: [PATCH 30/39] Revise Stanford CoreNLP --- src/architecture/extensibility/nlp-survey.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 42b7fdf..f81b77f 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -356,7 +356,7 @@ The results of the survey are below. | POS-tagging | X | | [POS-Tag Documentation]() | | | tokenized and sentence-splitted `CoreDocument` | `CoreDocument` with String List of POS-Tags as property | | Constituency parsing | X | [Viterbi Parser]()
[Shift reduce Parser]()
[Iterative CKYPCFG Parser]()
[Fast Factored Parser]()
[Exhaustive PCFG Parser]() | [Constituency Parser Documentation]() | | | tokenized, sentence-splitted (and for some models POS-tagged) `CoreDocument` | `CoreDocument` with TreeAnnotation (exact form depends on chosen parser) | | Dependency parsing | X |[BiLexPCFGParser]()
[Exhaustive Dependency Parser]() |[DepParse Documentation]() | X | [Train own Model]() |tokenized, sentence-splitted and POS-tagged `CoreDocument` | `CoreDocument` with DependencyAnnotation (exact form depends on chosen parser) | -| Named Entity Recognition | X | [NER Classifier Combiner](<>https://stanfordnlp.github.io/CoreNLP/ner.html)
[-RegexNERAnnotator]() | [NER Documentation]()| X | [NER Training Documentation]() | tokenized, ssplitted, pos-tagged, (lemmatized) `CoreDocument` | `CoreDocument` with `NamedEntityTagAnnotation` or `NormalizedNamedEntityTagAnnotation`| +| Named Entity Recognition | X | [NER Classifier Combiner]()
[-RegexNERAnnotator]() | [NER Documentation]()| X | [NER Training Documentation]() | tokenized, ssplitted, pos-tagged, (lemmatized) `CoreDocument` | `CoreDocument` with `NamedEntityTagAnnotation` or `NormalizedNamedEntityTagAnnotation`| | Functionalities extensible | X | [Custom annotator]() | | | | | Can consume own models | X | | [Example of including own (caseless) model]() | | | | From 6f01bfef27b42da950de445e4d7238cc056fa092 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Tue, 9 Feb 2021 11:23:36 +0100 Subject: [PATCH 31/39] Revise Textblob --- src/architecture/extensibility/nlp-survey.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index f81b77f..5a2d052 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -368,7 +368,7 @@ The results of the survey are below. 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java - - Can be addressed as follows: n/a + - Can be addressed as follows: to the best of our knowledge it's only accesible via common ways to integrate Python scripts in Java (e.g. JEPP, PythonInterpreter, Runtime.exec(),..) 2. [ ] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) 4. [ ] Is available from a p2 repository: From a65999427b570e59dee54fe84386a41ec3297d13 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Tue, 9 Feb 2021 12:02:24 +0100 Subject: [PATCH 32/39] Add Clear NLP --- src/architecture/extensibility/nlp-survey.md | 33 ++++++++++++++++++-- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 5a2d052..6365232 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -23,7 +23,7 @@ Specifically, the following properties were checked: The results of the survey are below. --- ---- + ### [AllenNLP]() 1. [ ] Implemented in Java @@ -51,7 +51,6 @@ The results of the survey are below. ---- ### [Apache OpenNLP]() @@ -79,7 +78,35 @@ The results of the survey are below. ---- + +### [Clear NLP]() + +1. [X] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: n/a +2. [X] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [X] Is available from a p2 repository: + +#### Feature matrix + +
+ +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | | [Tokenizer Documentation]() | | | input file in [raw format]() | output file in [line format]() | +| Sentencing | X | | [Sentence-Splitter Documentation]() | | | input file in [raw format]() | output file in [line format]() | +| POS-tagging | X | | | X | [POS-Tagger Documentation]() | input file in [raw format]() | output file in [tab separated values format]() | +| Constituency parsing | | | | | | | | +| Dependency parsing | X | | [DepParse Documentation]() | X | [General Training Documentation]() | input file in [raw format]() | output file in [tab separated values format]() | +| Named Entity Recognition | X | | [NER Documentation]() | X | [General Training Documentation]() | input file in [raw format]() | output file in [tab separated values format]() | +| Functionalities extensible | X | | [Configuration documentation]() | | | | | +| Can consume own models | X | | [How to add models]() | | | | | + + +
+ + ### [CogComp NLP Pipeline]() From 1d92e9a245288b48e1f6c404e31b370c05b2e338 Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Tue, 9 Feb 2021 12:44:34 +0100 Subject: [PATCH 33/39] Unify capitalization of data formats --- src/architecture/extensibility/nlp-survey.md | 102 ++++++++++++------- 1 file changed, 65 insertions(+), 37 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 6365232..b969ba7 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -22,7 +22,7 @@ Specifically, the following properties were checked: The results of the survey are below. ---- + ### [AllenNLP]() @@ -39,12 +39,12 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-SpacyTokenizer]()
[-PretrainedTransformerTokenizer]() | [Tokenizer Documentation]() | | | Text as str | List[Token] | -| Sentencing | X | -Sentence Splitter
-SpaCy Sentence Splitter | [Sentence-Splitter API]() | | | Text as str | List[str] | -| POS-tagging | | | | X | [Train SentenceTaggerPredictor]() | Sentence as str | Dict[str, numpy.ndarray] | +| Tokenization/segmentation | X | [-SpacyTokenizer]()
[-PretrainedTransformerTokenizer]() | [Tokenizer Documentation]() | | | text as str | list[token] | +| Sentencing | X | -Sentence Splitter
-SpaCy Sentence Splitter | [Sentence-Splitter API]() | | | text as str | list[str] | +| POS-tagging | | | | X | [Train SentenceTaggerPredictor]() | sentence as str | dict[str, numpy.ndarray] | | Constituency parsing | X | | [Constituency Parser Demo]() | | | | | | Dependency parsing | X | | [Dependency Parser Demo]() | | | | | -| Named Entity Recognition | | | | X | [Train SentenceTaggerPredictor]() | Sentence as str | Dict[str, numpy.ndarray] | +| Named Entity Recognition | | | | X | [Train SentenceTaggerPredictor]() | sentence as str | dict[str, numpy.ndarray] | | Functionalities extensible | | | | | | | | | Can consume own models | X | | [Building own models in AllenNLP]() | | | | | @@ -66,12 +66,12 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable | Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation |X|[-Whitespace Tokenizer]()
[-Character Tokenizer]()
[-Maximum Entropy Tokenizer]() |[Tokenizer Documentation]()| X | [Tokenizer Training Documentation]() |Text as String | -Array of Strings
-Array of Token Spans| -| Sentencing|X|[-Newline Sentence-Splitter]()
[-Maximum Entropy Sentence-Splitter]() |[Sentence-Splitter Documentation]()| X | [Sentence-Splitter Training Documentation]() |Text as String | -Array of Strings
-Array of sentence Spans | -| POS-tagging| X | | [POS-Tagger Documentation]() | x | [POS-Tagger Training Documentation]() | String Array of Tokens | String Array of POS-tags | -| Constituency parsing | X | | [Constituency Parser Documentation]() | X | [Constituency Parser Documentation]()| String of whitespace tokenized Sentence | Array of OpenNLP's Parse Type | +| Tokenization/segmentation |X|[-Whitespace Tokenizer]()
[-Character Tokenizer]()
[-Maximum Entropy Tokenizer]() |[Tokenizer Documentation]()| X | [Tokenizer Training Documentation]() |text as string | -array of strings
-array of token spans| +| Sentencing|X|[-Newline Sentence-Splitter]()
[-Maximum Entropy Sentence-Splitter]() |[Sentence-Splitter Documentation]()| X | [Sentence-Splitter Training Documentation]() |text as string | -array of strings
-array of sentence spans | +| POS-tagging| X | | [POS-Tagger Documentation]() | x | [POS-Tagger Training Documentation]() | string array of tokens | string array of POS-tags | +| Constituency parsing | X | | [Constituency Parser Documentation]() | X | [Constituency Parser Documentation]()| string of whitespace tokenized sentence | array of OpenNLP's `Parse` type | | Dependency parsing | | | | | | | | -| Named Entity Recognition | X | | [NER Documentaion]()| X | [NER Training Documentation]() | String Array of Tokens | Array of name Spans | +| Named Entity Recognition | X | | [NER Documentaion]()| X | [NER Training Documentation]() | string array of tokens | array of name spans | |Functionalities extensible|X| | [Extension writing Documentation]()| | | | | Can consume own models | X | | [General API description]() | | | | | @@ -123,12 +123,12 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | | [General description]() | | | String of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each Annotation | +| Tokenization/segmentation | X | | [General description]() | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | | Sentencing | | | | | | | | -| POS-tagging | X | | [General description]() | | | String of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each Annotation | -| Constituency parsing | X | | [General description]()
[Parser documentation]( ) | | | String of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each Annotation | -| Dependency parsing | X | -Stanford NLP Parser
-CogComp Parser (requires POS-Tagger and Chunker as part of the Pipeline) | [-General description]()
[-Stanford Parser documentation]( ) | | | String of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each Annotation | -| Named Entity Recognition | X (isn't part of the Pipeline) | | [NER Documentation]() | X | [Training Documentation]() | String of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each Annotation | +| POS-tagging | X | | [General description]() | | | string of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each annotation | +| Constituency parsing | X | | [General description]()
[Parser documentation]( ) | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | +| Dependency parsing | X | -Stanford NLP Parser
-CogComp Parser (requires POS-Tagger and Chunker as part of the Pipeline) | [-General description]()
[-Stanford Parser documentation]( ) | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | +| Named Entity Recognition | X (isn't part of the Pipeline) | | [NER Documentation]() | X | [Training Documentation]() | string of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each annotation | | Functionalities extensible | X | | [Configuration documentation]() | | | | | | Can consume own models | X | | [Use own Tokenizer]() | | | | | @@ -178,9 +178,9 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-IndoEuropeanTokenizer]()
[-CharacterTokenizer]()
[-RegExTokenizer]()
[-NGramTokenizer]()
[-LineTokenizerFactory]() | [Lingpipe Book]() Chapter 3 (p.33)
[Tokenizer API]()
[Tokenization API]( )| | | -String of text
-Character array, Startindex, Endindex | array of token strings | -| Sentencing | X | | [Sentence-Splitter Tutorial]( )| X | | Char[] array of text | Sentences as set of chunks | -| POS-tagging | X | [-Chain CRF Tagger]()
[-Classifier Tagger]()
[-HMMDecoder]() | [-Lingpipe Book Chapter 11]()
[-POS-Tagger Tutorial]() | X | [POS-Tagger Tutorial Paragraph: Training Training Part-of-Speech Models]() | List tokenList | [Tagging]() | +| Tokenization/segmentation | X | [-IndoEuropeanTokenizer]()
[-CharacterTokenizer]()
[-RegExTokenizer]()
[-NGramTokenizer]()
[-LineTokenizerFactory]() | [Lingpipe Book]() Chapter 3 (p.33)
[Tokenizer API]()
[Tokenization API]( )| | | -string of text
-char[] array, startindex, endindex | array of token strings | +| Sentencing | X | | [Sentence-Splitter Tutorial]( )| X | | char[] array of text | sentences as set of chunks | +| POS-tagging | X | [-Chain CRF Tagger]()
[-Classifier Tagger]()
[-HMMDecoder]() | [-Lingpipe Book Chapter 11]()
[-POS-Tagger Tutorial]() | X | [POS-Tagger Tutorial Paragraph: Training Training Part-of-Speech Models]() | list of tokens as strings | [Tagging]() | | Constituency parsing | | | | | | | | Dependency parsing | | | | | | | | | Named Entity Recognition | X | | [NER Documentation]()| | | | | @@ -210,7 +210,7 @@ The results of the survey are below. | Sentencing | | | | | | | | POS-tagging | X | X | [POS-Tagger Documentation]() | X|[POS-Tagger Training Documentation]() | | | | Constituency parsing | | | | | | | -| Dependency parsing | X | X | [DepParse Documentation]() | | | - Filepath with POS-tagged Dataset in CONNLL-U format
-list of list of `ConllEntry`, where each entry represents a POS-tagged Token and each nested List a sentence | | +| Dependency parsing | X | X | [DepParse Documentation]() | | | - Filepath with POS-tagged Dataset in CONNLL-U format
-list of list of `ConllEntry`, where each entry represents a POS-tagged token and each nested list a sentence | | | Named Entity Recognition | X | [Neural Tagger- CNNLSTM module]()
[Neural Tagger- IDCNN]()
[Transformer Token Classifier]() | [NER Documentation]() | X | [NER Training Documentation]() | | | | Trainable models | X | | | | | | | Can consume own models | X | | | | | | @@ -233,12 +233,12 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | | [Tokenizer basic information]()
[Tokenizer Demo]() | X | [Basic Training information]() | -string line of text
-Raw document text
tsv-format | List of Token types | +| Tokenization/segmentation | X | | [Tokenizer basic information]()
[Tokenizer Demo]() | X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | list of `Token` types | | Sentencing | | | | | | | | -| POS-tagging | X | | [POS-Tag basic information]() | X | [Basic Training information]() | -string line of text
-Raw document text
tsv-format | Output-file with annotations| +| POS-tagging | X | | [POS-Tag basic information]() | X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | output-file with annotations| | Constituency parsing | | | | | | | | -| Dependency parsing | X | | [Dependency Parse basic information]() | X | [Basic Training information]() | -string line of text
-Raw document text
tsv-format | Output-file with annotations | -| Named Entity Recognition | X | | [NER Documentation]()| X | [Basic Training information]() | -string line of text
-Raw document text
tsv-format | Output-file with annotations | +| Dependency parsing | X | | [Dependency Parse basic information]() | X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | output-file with annotations | +| Named Entity Recognition | X | | [NER Documentation]()| X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | output-file with annotations | | Functionalities extensible | | | | | | | | | Can consume own models | | | | | | | | @@ -265,8 +265,8 @@ The results of the survey are below. | Sentencing | X | [-Punkt Sent Module]()
[-Regexp Tokenizer]()
[-Simple Tokenizer]() | [-NLTK Tokenizer/Sentence Splitter Documentation]()
[-NLTK Book Chapter 3]() |X | [-Punkt Sent Module]() | str of text | list(str) of sentences | | POS-tagging | X | [-Brill Tagger]()
[-CRF Tagger]()
[-HMM Tagger]()
[-HunPos Tagger]()
[-Perceptron Tagger]()
[-Senna POS Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]()| [-NLTK Tagger Documentation]()
[-NLTK Book Chapter 5]() | X| [-Brill Tagger Training]()
[-CRF Tagger]()
[-HMM Tagger]()
[HunPos Tagger]()
[-Perceptron Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]() | -tokens (list(str))
-sentences (list(list(str))) | -list(tuple(str, str))
-list(list(tuple(str, str)) ) | | Constituency parsing | X | [-Early Chart Parser]()
[-Recursive descent Parser]()
[-Shift Reduce Parser]()
[-Standford Parser]()
| [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8]() | | | sent (list(str))
sentences (list(list(str))) | -iter(Tree)
-iter(iter(Tree)) | -| Dependency parsing | X | [-CoreNLP Dependency Parser]()
[-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8 Pargraph 5]() | X | [-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | -CoreNLP: sentences (list(str)) – Input sentences to parse
-Malt: str sentences | CoreNLP: iter(iter(Tree))
Malt:iter(DependencyGraph) | -| Named Entity Recognition | X | | [-NLTK Book Chapter 7 Paragraph 5](< https://www.nltk.org/book/ch07.html>)
[-NE Chunker]()| X | [-NE Chunker]()| list of POS-Tagged tokens | NE-tagged parse-tree | +| Dependency parsing | X | [-CoreNLP Dependency Parser]()
[-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8 Pargraph 5]() | X | [-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | -CoreNLP: sentences (list(str)) – input sentences to parse
-Malt: str sentences | CoreNLP: iter(iter(Tree))
Malt:iter(DependencyGraph) | +| Named Entity Recognition | X | | [-NLTK Book Chapter 7 Paragraph 5](< https://www.nltk.org/book/ch07.html>)
[-NE Chunker]()| X | [-NE Chunker]()| list of POS-tagged tokens | NE-tagged parse-tree | | Functionalities extensible | | | | | | | | | Can consume own models | X | | | | | | | @@ -289,9 +289,9 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | |[General documentation]() | | | String of text | nested list : [sentences[chunks[tokens]]] | -| Sentencing | X | | [General documentation]()| | | String of text | nested list : [sentences[chunks[tokens]]] | | | -| POS-tagging | X | | [General documentation]() | | | String of text | nested list : [sentences[chunks[tokens]]] | +| Tokenization/segmentation | X | |[General documentation]() | | | string of text | nested list : [sentences[chunks[tokens]]] | +| Sentencing | X | | [General documentation]()| | | string of text | nested list : [sentences[chunks[tokens]]] | | | +| POS-tagging | X | | [General documentation]() | | | string of text | nested list : [sentences[chunks[tokens]]] | | Constituency parsing | | | | | | | | | Dependency parsing | | | | | | | | | Named Entity Recognition | | | | | | | | @@ -326,7 +326,7 @@ The results of the survey are below. | POS-tagging | X | | [POS-Tagger Documentation]() |X | [Training Basics]()
[Updating POS-Tagger]() | Spacy's `Doc` type |Spacy's `Doc` type - POS-Tags access through`token.pos_` attribute| | Constituency parsing | | | | | | | | | Dependency parsing | X | | [Parser Documentation]() | X| [Training Basics]()
[Updating Dependencs Parser]() | Spacy's `Doc` type | Spacy's `Doc` type - Parse-tags access through `token.dep_` attribute | -| Named Entity Recognition | X | | [NER Documentation]()| | |SpaCy's `Doc` type | SpaCy's `Doc` type - entitytags access through `doc.ents` property| +| Named Entity Recognition | X | | [NER Documentation]()| | |SpaCy's `Doc` type | SpaCy's `Doc` type - entity-tags access through `doc.ents` property| | Functionalities extensible | X | | [Extend Tokenizer]()
[Customize Tokenizer]()
[Custom Component]() | | | | | Can consume own models | (X) | | [Load different Tokenizer]() | | | | @@ -349,13 +349,13 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [^2] | [Tokenizer Documentation]() | | | `Document` Type | `Token` Type | -| Sentencing | X | [^2] | [Sentence Detector Documentation]() | | | `Document` Type | `Sentence` Types | -| POS-tagging | X | [^2] | [POS-Tagger Documentation]() | X | [POS Training Documentation]()
[General Training Documentation]()| `Document` Type
`Token`Type | `POS` Type | +| Tokenization/segmentation | X | [^2] | [Tokenizer Documentation]() | | | `Document` type | `Token` types | +| Sentencing | X | [^2] | [Sentence Detector Documentation]() | | | `Document` type | `Sentence` types | +| POS-tagging | X | [^2] | [POS-Tagger Documentation]() | X | [POS Training Documentation]()
[General Training Documentation]()| `Document` type
`Token`types | `POS` Type | | Constituency parsing | | | | | | | | -| Dependency parsing | X | [-Typed Dependency Parser]()
[Untyped Dependency Parser]() | [Dependency Parser Documentation]()| X | [General Training Documentation]() | `Document` Type
`POS` Type
`Token` Type | (unlabeled) `Dependency` Type | +| Dependency parsing | X | [-Typed Dependency Parser]()
[Untyped Dependency Parser]() | [Dependency Parser Documentation]()| X | [General Training Documentation]() | `Document` type
`POS` type
`Token` type | (unlabeled) `Dependency` types | | Named Entity Recognition | X | [-NER CRF]()
[-NER DL]()| [NER Documentation](<>)| X | [-NER CRF]()
[-NER DL]() | | | -| Functionalities extensible | X | | [Manipulating Pipelines]() | | | `Document` Type | `Named_Entity` Type | +| Functionalities extensible | X | | [Manipulating Pipelines]() | | | `Document` type | `Named_Entity` types | | Can consume own models | X | | [General Concept Documentation]() | | | | | @@ -388,6 +388,34 @@ The results of the survey are below. | Can consume own models | X | | [Example of including own (caseless) model]() | | | | + + +### [Talismane]() + +1. [X] Implemented in Java + 1. [ ] Not Java, but API can be addressed from Java + - Can be addressed as follows: to the best of our knowledge it's only accesible via common ways to integrate Python scripts in Java (e.g. JEPP, PythonInterpreter, Runtime.exec(),..) +2. [X] Uses Maven (`pom.xml` exists) +3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [X] Is available from a p2 repository: https://mvnrepository.com/artifact/com.joliciel.talismane/talismane-core + +#### Feature matrix + +
+ +| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | +|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| +| Tokenization/segmentation | X | -Simple tokenizer
-pattern tokenizer | [Tokenization Documentation](<>) | X | [Tokenizer Training Documentation]() | string of raw text | CoNLL format | +| Sentencing | X | | [Sentence-Splitting Tutorial]()
[Advanced Sentence-Splitting Documentation]() | X | [Sentence-Splitter Training Documentation]() | string of raw text | CoNLL format | +| POS-tagging | X | | [POS-Tagger Documentation]()
[POS-Tagger Advanced Usage]() | X | [POS-Tagger Training Documentation]() | String of raw text | CoNLL format | +| Constituency parsing | | | | | | | +| Dependency parsing | X | |[DepParser Documentation (under construction)]() | X | [DepParser Training Documentation]() | string of raw text | CoNLL format | +| Named Entity Recognition | | | | | | | | +| Functionalities extensible | X | | [Advanced Usage]() | | | | | +| Can consume own models | X | | [Passing models into the Pipeline]()
[Training own data]() | | | | | + + +
@@ -406,9 +434,9 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | | [Tokenization Tutorial]()
[Advanced Tokenization Documentation]() | | | String of raw text | `TextBlob` data type - access through `words` property: WordList of word strings | -| Sentencing | X | | [Sentence-Splitting Tutorial]()
[Advanced Sentence-Splitting Documentation]() | | | String of raw text | `TextBlob` data type - access through `sentences` property: List of Sentence Objects| -| POS-tagging | X | -PatternTagger
-NLTKTagger | [POS-Tagger Tutorial]()
[POS-Tagger Advanced Usage]() | | | String of raw text | `TextBlob` data type - access through `tags` property: List of word string tag string tuples | +| Tokenization/segmentation | X | | [Tokenization Tutorial]()
[Advanced Tokenization Documentation]() | | | string of raw text | `TextBlob` data type - access through `words` property: WordList of word strings | +| Sentencing | X | | [Sentence-Splitting Tutorial]()
[Advanced Sentence-Splitting Documentation]() | | | string of raw text | `TextBlob` data type - access through `sentences` property: list of sentence objects| +| POS-tagging | X | -PatternTagger
-NLTKTagger | [POS-Tagger Tutorial]()
[POS-Tagger Advanced Usage]() | | | string of raw text | `TextBlob` data type - access through `tags` property: List of word string tag string tuples | | Constituency parsing | X | | [Parser Tutorial]()
[Parser Advanced Usage]()| | | String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | | Dependency parsing | X | |[Parser Tutorial]()
[Parser Advanced Usage]()| | |String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | | Named Entity Recognition | | | | | | | | From 3a983936b4e7d912d47356d5724509663743e1ca Mon Sep 17 00:00:00 2001 From: clachenmaier Date: Wed, 24 Feb 2021 14:37:58 +0100 Subject: [PATCH 34/39] Remove wrong repositories --- src/architecture/extensibility/nlp-survey.md | 28 ++++++++++---------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index b969ba7..51d473b 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -58,8 +58,8 @@ The results of the survey are below. 1. [ ] Not Java, but API can be addressed from Java - Can be addressed as follows: n/a 2. [x] Uses Maven (`pom.xml` exists) -3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [X] Is available from a p2 repository: [OpenNLP Repository]() +3. [?] Is available as OSGi bundle (has `MANIFEST.MF`) +4. [?] Is available from a p2 repository: #### Feature matrix
@@ -86,7 +86,7 @@ The results of the survey are below. - Can be addressed as follows: n/a 2. [X] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [X] Is available from a p2 repository: +4. [ ] Is available from a p2 repository: #### Feature matrix @@ -115,7 +115,7 @@ The results of the survey are below. - Can be addressed as follows: n/a 2. [X] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [X] Is available from a p2 repository: +4. [ ] Is available from a p2 repository: #### Feature matrix @@ -225,7 +225,7 @@ The results of the survey are below. - Can be addressed as follows: n/a 2. [X] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [X] Is available from a p2 repository: +4. [ ] Is available from a p2 repository: #### Feature matrix @@ -341,7 +341,7 @@ The results of the survey are below. - Can be addressed as follows: [Using Spark NLP via Scala and Maven]() 2. [X] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [X] Is available from a p2 repository: +4. [ ] Is available from a p2 repository: #### Feature matrix @@ -370,17 +370,17 @@ The results of the survey are below. - Can be addressed as follows: n/a 2. [X] Uses Maven (`pom.xml` exists) 3. [X] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [X] Is available from a p2 repository: - +4. [X] Is available from a p2 repository: ++ #### Feature matrix
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | see [list]() of classes implementing the Tokenizer interface | [Tokenizer Documentation]() | | | -string of document text
-CoreNLPs `CoreDocument` (Instatiated with string document text) | -list of strings
-list of characteroffsetbegin indices
-list of characteroffsetendindices
`CoreDocument` with previous annotation properties | +| Tokenization/segmentation | X | [-Abstract Tokenizer]()
[-CHBT Tokenizer]()
[-Lexer Tokenizer]()
[-NegraPennTokenizer]()
[-PennTreebankTokenizer]()
[-PTBTokenizer]()
[-Robust Tokenizer]()
[-WhitespaceTokenizer]() | [Tokenizer Documentation]() | | | -string of document text
-CoreNLPs `CoreDocument` (Instatiated with string document text) | -list of strings
-list of characteroffsetbegin indices
-list of characteroffsetendindices
`CoreDocument` with previous annotation properties | | Sentencing | X | | [Sentencer Documentation]() | | | tokenized `CoreDocument` | `CoreDocument` with Sentence List of POS-Tags as property | -| POS-tagging | X | | [POS-Tag Documentation]() | | | tokenized and sentence-splitted `CoreDocument` | `CoreDocument` with String List of POS-Tags as property | +| POS-tagging | X | | [POS-Tag Documentation]() | X | | tokenized and sentence-splitted `CoreDocument` | `CoreDocument` with String List of POS-Tags as property | | Constituency parsing | X | [Viterbi Parser]()
[Shift reduce Parser]()
[Iterative CKYPCFG Parser]()
[Fast Factored Parser]()
[Exhaustive PCFG Parser]() | [Constituency Parser Documentation]() | | | tokenized, sentence-splitted (and for some models POS-tagged) `CoreDocument` | `CoreDocument` with TreeAnnotation (exact form depends on chosen parser) | | Dependency parsing | X |[BiLexPCFGParser]()
[Exhaustive Dependency Parser]() |[DepParse Documentation]() | X | [Train own Model]() |tokenized, sentence-splitted and POS-tagged `CoreDocument` | `CoreDocument` with DependencyAnnotation (exact form depends on chosen parser) | | Named Entity Recognition | X | [NER Classifier Combiner]()
[-RegexNERAnnotator]() | [NER Documentation]()| X | [NER Training Documentation]() | tokenized, ssplitted, pos-tagged, (lemmatized) `CoreDocument` | `CoreDocument` with `NamedEntityTagAnnotation` or `NormalizedNamedEntityTagAnnotation`| @@ -397,7 +397,7 @@ The results of the survey are below. - Can be addressed as follows: to the best of our knowledge it's only accesible via common ways to integrate Python scripts in Java (e.g. JEPP, PythonInterpreter, Runtime.exec(),..) 2. [X] Uses Maven (`pom.xml` exists) 3. [ ] Is available as OSGi bundle (has `MANIFEST.MF`) -4. [X] Is available from a p2 repository: https://mvnrepository.com/artifact/com.joliciel.talismane/talismane-core +4. [ ] Is available from a p2 repository: #### Feature matrix @@ -406,13 +406,13 @@ The results of the survey are below. | | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| | Tokenization/segmentation | X | -Simple tokenizer
-pattern tokenizer | [Tokenization Documentation](<>) | X | [Tokenizer Training Documentation]() | string of raw text | CoNLL format | -| Sentencing | X | | [Sentence-Splitting Tutorial]()
[Advanced Sentence-Splitting Documentation]() | X | [Sentence-Splitter Training Documentation]() | string of raw text | CoNLL format | -| POS-tagging | X | | [POS-Tagger Documentation]()
[POS-Tagger Advanced Usage]() | X | [POS-Tagger Training Documentation]() | String of raw text | CoNLL format | +| Sentencing | X | | | X | [Sentence-Splitter Training Documentation]() | string of raw text | CoNLL format | +| POS-tagging | X | | | X | [POS-Tagger Training Documentation]() | String of raw text | CoNLL format | | Constituency parsing | | | | | | | | Dependency parsing | X | |[DepParser Documentation (under construction)]() | X | [DepParser Training Documentation]() | string of raw text | CoNLL format | | Named Entity Recognition | | | | | | | | | Functionalities extensible | X | | [Advanced Usage]() | | | | | -| Can consume own models | X | | [Passing models into the Pipeline]()
[Training own data]() | | | | | +| Can consume own models | | | | | | | | From f5a4985f98d9e7f58de9dd815db4845c2eb2bb9e Mon Sep 17 00:00:00 2001 From: Stephan Druskat Date: Wed, 24 Feb 2021 18:52:50 +0100 Subject: [PATCH 35/39] Remove whitespaces and prepare PR --- src/architecture/extensibility/nlp-survey.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 51d473b..f114e39 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -22,8 +22,6 @@ Specifically, the following properties were checked: The results of the survey are below. - - ### [AllenNLP]() 1. [ ] Implemented in Java From efa89a34e7ae223ce5c4ee189525b0022d73916c Mon Sep 17 00:00:00 2001 From: Stephan Druskat Date: Wed, 24 Feb 2021 19:17:36 +0100 Subject: [PATCH 36/39] Clean up whitespaces --- src/architecture/extensibility/nlp-survey.md | 110 +++++++++---------- 1 file changed, 55 insertions(+), 55 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index f114e39..b54e52a 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -33,12 +33,12 @@ The results of the survey are below. #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-SpacyTokenizer]()
[-PretrainedTransformerTokenizer]() | [Tokenizer Documentation]() | | | text as str | list[token] | -| Sentencing | X | -Sentence Splitter
-SpaCy Sentence Splitter | [Sentence-Splitter API]() | | | text as str | list[str] | +| Tokenization/segmentation | X | [-SpacyTokenizer]()
[-PretrainedTransformerTokenizer]() | [Tokenizer Documentation]() | | | text as str | list[token] | +| Sentencing | X | -Sentence Splitter
-SpaCy Sentence Splitter | [Sentence-Splitter API]() | | | text as str | list[str] | | POS-tagging | | | | X | [Train SentenceTaggerPredictor]() | sentence as str | dict[str, numpy.ndarray] | | Constituency parsing | X | | [Constituency Parser Demo]() | | | | | | Dependency parsing | X | | [Dependency Parser Demo]() | | | | | @@ -60,12 +60,12 @@ The results of the survey are below. 4. [?] Is available from a p2 repository: #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable | Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation |X|[-Whitespace Tokenizer]()
[-Character Tokenizer]()
[-Maximum Entropy Tokenizer]() |[Tokenizer Documentation]()| X | [Tokenizer Training Documentation]() |text as string | -array of strings
-array of token spans| -| Sentencing|X|[-Newline Sentence-Splitter]()
[-Maximum Entropy Sentence-Splitter]() |[Sentence-Splitter Documentation]()| X | [Sentence-Splitter Training Documentation]() |text as string | -array of strings
-array of sentence spans | +| Tokenization/segmentation |X|[-Whitespace Tokenizer]()
[-Character Tokenizer]()
[-Maximum Entropy Tokenizer]() |[Tokenizer Documentation]()| X | [Tokenizer Training Documentation]() |text as string | -array of strings
-array of token spans| +| Sentencing|X|[-Newline Sentence-Splitter]()
[-Maximum Entropy Sentence-Splitter]() |[Sentence-Splitter Documentation]()| X | [Sentence-Splitter Training Documentation]() |text as string | -array of strings
-array of sentence spans | | POS-tagging| X | | [POS-Tagger Documentation]() | x | [POS-Tagger Training Documentation]() | string array of tokens | string array of POS-tags | | Constituency parsing | X | | [Constituency Parser Documentation]() | X | [Constituency Parser Documentation]()| string of whitespace tokenized sentence | array of OpenNLP's `Parse` type | | Dependency parsing | | | | | | | | @@ -88,7 +88,7 @@ The results of the survey are below. #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| @@ -117,15 +117,15 @@ The results of the survey are below. #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| | Tokenization/segmentation | X | | [General description]() | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | | Sentencing | | | | | | | | | POS-tagging | X | | [General description]() | | | string of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each annotation | -| Constituency parsing | X | | [General description]()
[Parser documentation]( ) | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | -| Dependency parsing | X | -Stanford NLP Parser
-CogComp Parser (requires POS-Tagger and Chunker as part of the Pipeline) | [-General description]()
[-Stanford Parser documentation]( ) | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | +| Constituency parsing | X | | [General description]()
[Parser documentation]( ) | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | +| Dependency parsing | X | -Stanford NLP Parser
-CogComp Parser (requires POS-Tagger and Chunker as part of the Pipeline) | [-General description]()
[-Stanford Parser documentation]( ) | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | | Named Entity Recognition | X (isn't part of the Pipeline) | | [NER Documentation]() | X | [Training Documentation]() | string of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each annotation | | Functionalities extensible | X | | [Configuration documentation]() | | | | | | Can consume own models | X | | [Use own Tokenizer]() | | | | | @@ -145,16 +145,16 @@ The results of the survey are below. #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| | Tokenization/segmentation | X | | [ANNIE Tokenizer Documentation]( )| | | string of document text | output file | -| Sentencing | X | -ANNIE Default Sentence Splitter
-ANNIE Regex Sentence Splitter | [ANNIE Sentence Splitter Documentation]( ) | | | | +| Sentencing | X | -ANNIE Default Sentence Splitter
-ANNIE Regex Sentence Splitter | [ANNIE Sentence Splitter Documentation]( ) | | | | | POS-tagging | X | | [ANNIE POS-Tagger Documentation]() | | | | | Constituency parsing | | | | | | | | Dependency parsing | | | | | | | -| Named Entity Recognition | x | | [Gazeteer documentation]()
[-Semantic Tagger]() | | | | | +| Named Entity Recognition | x | | [Gazeteer documentation]()
[-Semantic Tagger]() | | | | | | Functionalities extensible | | | | | | | | Can consume own models | | | | | | | @@ -172,13 +172,13 @@ The results of the survey are below. #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-IndoEuropeanTokenizer]()
[-CharacterTokenizer]()
[-RegExTokenizer]()
[-NGramTokenizer]()
[-LineTokenizerFactory]() | [Lingpipe Book]() Chapter 3 (p.33)
[Tokenizer API]()
[Tokenization API]( )| | | -string of text
-char[] array, startindex, endindex | array of token strings | +| Tokenization/segmentation | X | [-IndoEuropeanTokenizer]()
[-CharacterTokenizer]()
[-RegExTokenizer]()
[-NGramTokenizer]()
[-LineTokenizerFactory]() | [Lingpipe Book]() Chapter 3 (p.33)
[Tokenizer API]()
[Tokenization API]( )| | | -string of text
-char[] array, startindex, endindex | array of token strings | | Sentencing | X | | [Sentence-Splitter Tutorial]( )| X | | char[] array of text | sentences as set of chunks | -| POS-tagging | X | [-Chain CRF Tagger]()
[-Classifier Tagger]()
[-HMMDecoder]() | [-Lingpipe Book Chapter 11]()
[-POS-Tagger Tutorial]() | X | [POS-Tagger Tutorial Paragraph: Training Training Part-of-Speech Models]() | list of tokens as strings | [Tagging]() | +| POS-tagging | X | [-Chain CRF Tagger]()
[-Classifier Tagger]()
[-HMMDecoder]() | [-Lingpipe Book Chapter 11]()
[-POS-Tagger Tutorial]() | X | [POS-Tagger Tutorial Paragraph: Training Training Part-of-Speech Models]() | list of tokens as strings | [Tagging]() | | Constituency parsing | | | | | | | | Dependency parsing | | | | | | | | | Named Entity Recognition | X | | [NER Documentation]()| | | | | @@ -200,7 +200,7 @@ The results of the survey are below. #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| @@ -208,8 +208,8 @@ The results of the survey are below. | Sentencing | | | | | | | | POS-tagging | X | X | [POS-Tagger Documentation]() | X|[POS-Tagger Training Documentation]() | | | | Constituency parsing | | | | | | | -| Dependency parsing | X | X | [DepParse Documentation]() | | | - Filepath with POS-tagged Dataset in CONNLL-U format
-list of list of `ConllEntry`, where each entry represents a POS-tagged token and each nested list a sentence | | -| Named Entity Recognition | X | [Neural Tagger- CNNLSTM module]()
[Neural Tagger- IDCNN]()
[Transformer Token Classifier]() | [NER Documentation]() | X | [NER Training Documentation]() | | | +| Dependency parsing | X | X | [DepParse Documentation]() | | | - Filepath with POS-tagged Dataset in CONNLL-U format
-list of list of `ConllEntry`, where each entry represents a POS-tagged token and each nested list a sentence | | +| Named Entity Recognition | X | [Neural Tagger- CNNLSTM module]()
[Neural Tagger- IDCNN]()
[Transformer Token Classifier]() | [NER Documentation]() | X | [NER Training Documentation]() | | | | Trainable models | X | | | | | | | Can consume own models | X | | | | | | @@ -227,16 +227,16 @@ The results of the survey are below. #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | | [Tokenizer basic information]()
[Tokenizer Demo]() | X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | list of `Token` types | +| Tokenization/segmentation | X | | [Tokenizer basic information]()
[Tokenizer Demo]() | X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | list of `Token` types | | Sentencing | | | | | | | | -| POS-tagging | X | | [POS-Tag basic information]() | X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | output-file with annotations| +| POS-tagging | X | | [POS-Tag basic information]() | X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | output-file with annotations| | Constituency parsing | | | | | | | | -| Dependency parsing | X | | [Dependency Parse basic information]() | X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | output-file with annotations | -| Named Entity Recognition | X | | [NER Documentation]()| X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | output-file with annotations | +| Dependency parsing | X | | [Dependency Parse basic information]() | X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | output-file with annotations | +| Named Entity Recognition | X | | [NER Documentation]()| X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | output-file with annotations | | Functionalities extensible | | | | | | | | | Can consume own models | | | | | | | | @@ -255,16 +255,16 @@ The results of the survey are below. #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-Casual module]()
[-Destructive module]()
[-Regexp Tokenizer]()
[-Repp Tokenizer]()
[-Simple Tokenizer]()
[-Stanford Segmenter]()
[-TokTok Tokenizer]()
[-Penn Treebank Tokenizer]()| [-NLTK Tokenizer/Sentence Splitter Documentation]()
[-NLTK Book Chapter 3]() | | | str of text | list(str) of token | -| Sentencing | X | [-Punkt Sent Module]()
[-Regexp Tokenizer]()
[-Simple Tokenizer]() | [-NLTK Tokenizer/Sentence Splitter Documentation]()
[-NLTK Book Chapter 3]() |X | [-Punkt Sent Module]() | str of text | list(str) of sentences | -| POS-tagging | X | [-Brill Tagger]()
[-CRF Tagger]()
[-HMM Tagger]()
[-HunPos Tagger]()
[-Perceptron Tagger]()
[-Senna POS Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]()| [-NLTK Tagger Documentation]()
[-NLTK Book Chapter 5]() | X| [-Brill Tagger Training]()
[-CRF Tagger]()
[-HMM Tagger]()
[HunPos Tagger]()
[-Perceptron Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]() | -tokens (list(str))
-sentences (list(list(str))) | -list(tuple(str, str))
-list(list(tuple(str, str)) ) | -| Constituency parsing | X | [-Early Chart Parser]()
[-Recursive descent Parser]()
[-Shift Reduce Parser]()
[-Standford Parser]()
| [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8]() | | | sent (list(str))
sentences (list(list(str))) | -iter(Tree)
-iter(iter(Tree)) | -| Dependency parsing | X | [-CoreNLP Dependency Parser]()
[-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8 Pargraph 5]() | X | [-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | -CoreNLP: sentences (list(str)) – input sentences to parse
-Malt: str sentences | CoreNLP: iter(iter(Tree))
Malt:iter(DependencyGraph) | -| Named Entity Recognition | X | | [-NLTK Book Chapter 7 Paragraph 5](< https://www.nltk.org/book/ch07.html>)
[-NE Chunker]()| X | [-NE Chunker]()| list of POS-tagged tokens | NE-tagged parse-tree | +| Tokenization/segmentation | X | [-Casual module]()
[-Destructive module]()
[-Regexp Tokenizer]()
[-Repp Tokenizer]()
[-Simple Tokenizer]()
[-Stanford Segmenter]()
[-TokTok Tokenizer]()
[-Penn Treebank Tokenizer]()| [-NLTK Tokenizer/Sentence Splitter Documentation]()
[-NLTK Book Chapter 3]() | | | str of text | list(str) of token | +| Sentencing | X | [-Punkt Sent Module]()
[-Regexp Tokenizer]()
[-Simple Tokenizer]() | [-NLTK Tokenizer/Sentence Splitter Documentation]()
[-NLTK Book Chapter 3]() |X | [-Punkt Sent Module]() | str of text | list(str) of sentences | +| POS-tagging | X | [-Brill Tagger]()
[-CRF Tagger]()
[-HMM Tagger]()
[-HunPos Tagger]()
[-Perceptron Tagger]()
[-Senna POS Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]()| [-NLTK Tagger Documentation]()
[-NLTK Book Chapter 5]() | X| [-Brill Tagger Training]()
[-CRF Tagger]()
[-HMM Tagger]()
[HunPos Tagger]()
[-Perceptron Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]() | -tokens (list(str))
-sentences (list(list(str))) | -list(tuple(str, str))
-list(list(tuple(str, str)) ) | +| Constituency parsing | X | [-Early Chart Parser]()
[-Recursive descent Parser]()
[-Shift Reduce Parser]()
[-Standford Parser]()
| [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8]() | | | sent (list(str))
sentences (list(list(str))) | -iter(Tree)
-iter(iter(Tree)) | +| Dependency parsing | X | [-CoreNLP Dependency Parser]()
[-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8 Pargraph 5]() | X | [-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | -CoreNLP: sentences (list(str)) – input sentences to parse
-Malt: str sentences | CoreNLP: iter(iter(Tree))
Malt:iter(DependencyGraph) | +| Named Entity Recognition | X | | [-NLTK Book Chapter 7 Paragraph 5](< https://www.nltk.org/book/ch07.html>)
[-NE Chunker]()| X | [-NE Chunker]()| list of POS-tagged tokens | NE-tagged parse-tree | | Functionalities extensible | | | | | | | | | Can consume own models | X | | | | | | | @@ -283,7 +283,7 @@ The results of the survey are below. #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| @@ -315,17 +315,17 @@ The results of the survey are below. #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| | Tokenization/segmentation | X | | [Tokenizer Documentaion]( )| | |raw document text | SpaCy's `Doc` type - tokens accessible through `token.text` property | | Sentencing | X | | [Sentencizer Documentation]() | | | SpaCy's `Doc` type | SpaCy's `Doc` type - sentence access through `doc.sents` property | -| POS-tagging | X | | [POS-Tagger Documentation]() |X | [Training Basics]()
[Updating POS-Tagger]() | Spacy's `Doc` type |Spacy's `Doc` type - POS-Tags access through`token.pos_` attribute| +| POS-tagging | X | | [POS-Tagger Documentation]() |X | [Training Basics]()
[Updating POS-Tagger]() | Spacy's `Doc` type |Spacy's `Doc` type - POS-Tags access through`token.pos_` attribute| | Constituency parsing | | | | | | | | -| Dependency parsing | X | | [Parser Documentation]() | X| [Training Basics]()
[Updating Dependencs Parser]() | Spacy's `Doc` type | Spacy's `Doc` type - Parse-tags access through `token.dep_` attribute | +| Dependency parsing | X | | [Parser Documentation]() | X| [Training Basics]()
[Updating Dependencs Parser]() | Spacy's `Doc` type | Spacy's `Doc` type - Parse-tags access through `token.dep_` attribute | | Named Entity Recognition | X | | [NER Documentation]()| | |SpaCy's `Doc` type | SpaCy's `Doc` type - entity-tags access through `doc.ents` property| -| Functionalities extensible | X | | [Extend Tokenizer]()
[Customize Tokenizer]()
[Custom Component]() | | | | +| Functionalities extensible | X | | [Extend Tokenizer]()
[Customize Tokenizer]()
[Custom Component]() | | | | | Can consume own models | (X) | | [Load different Tokenizer]() | | | | @@ -343,16 +343,16 @@ The results of the survey are below. #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| | Tokenization/segmentation | X | [^2] | [Tokenizer Documentation]() | | | `Document` type | `Token` types | | Sentencing | X | [^2] | [Sentence Detector Documentation]() | | | `Document` type | `Sentence` types | -| POS-tagging | X | [^2] | [POS-Tagger Documentation]() | X | [POS Training Documentation]()
[General Training Documentation]()| `Document` type
`Token`types | `POS` Type | +| POS-tagging | X | [^2] | [POS-Tagger Documentation]() | X | [POS Training Documentation]()
[General Training Documentation]()| `Document` type
`Token`types | `POS` Type | | Constituency parsing | | | | | | | | -| Dependency parsing | X | [-Typed Dependency Parser]()
[Untyped Dependency Parser]() | [Dependency Parser Documentation]()| X | [General Training Documentation]() | `Document` type
`POS` type
`Token` type | (unlabeled) `Dependency` types | -| Named Entity Recognition | X | [-NER CRF]()
[-NER DL]()| [NER Documentation](<>)| X | [-NER CRF]()
[-NER DL]() | | | +| Dependency parsing | X | [-Typed Dependency Parser]()
[Untyped Dependency Parser]() | [Dependency Parser Documentation]()| X | [General Training Documentation]() | `Document` type
`POS` type
`Token` type | (unlabeled) `Dependency` types | +| Named Entity Recognition | X | [-NER CRF]()
[-NER DL]()| [NER Documentation](<>)| X | [-NER CRF]()
[-NER DL]() | | | | Functionalities extensible | X | | [Manipulating Pipelines]() | | | `Document` type | `Named_Entity` types | | Can consume own models | X | | [General Concept Documentation]() | | | | | @@ -369,19 +369,19 @@ The results of the survey are below. 2. [X] Uses Maven (`pom.xml` exists) 3. [X] Is available as OSGi bundle (has `MANIFEST.MF`) 4. [X] Is available from a p2 repository: -+ + #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-Abstract Tokenizer]()
[-CHBT Tokenizer]()
[-Lexer Tokenizer]()
[-NegraPennTokenizer]()
[-PennTreebankTokenizer]()
[-PTBTokenizer]()
[-Robust Tokenizer]()
[-WhitespaceTokenizer]() | [Tokenizer Documentation]() | | | -string of document text
-CoreNLPs `CoreDocument` (Instatiated with string document text) | -list of strings
-list of characteroffsetbegin indices
-list of characteroffsetendindices
`CoreDocument` with previous annotation properties | +| Tokenization/segmentation | X | [-Abstract Tokenizer]()
[-CHBT Tokenizer]()
[-Lexer Tokenizer]()
[-NegraPennTokenizer]()
[-PennTreebankTokenizer]()
[-PTBTokenizer]()
[-Robust Tokenizer]()
[-WhitespaceTokenizer]() | [Tokenizer Documentation]() | | | -string of document text
-CoreNLPs `CoreDocument` (Instatiated with string document text) | -list of strings
-list of characteroffsetbegin indices
-list of characteroffsetendindices
`CoreDocument` with previous annotation properties | | Sentencing | X | | [Sentencer Documentation]() | | | tokenized `CoreDocument` | `CoreDocument` with Sentence List of POS-Tags as property | | POS-tagging | X | | [POS-Tag Documentation]() | X | | tokenized and sentence-splitted `CoreDocument` | `CoreDocument` with String List of POS-Tags as property | -| Constituency parsing | X | [Viterbi Parser]()
[Shift reduce Parser]()
[Iterative CKYPCFG Parser]()
[Fast Factored Parser]()
[Exhaustive PCFG Parser]() | [Constituency Parser Documentation]() | | | tokenized, sentence-splitted (and for some models POS-tagged) `CoreDocument` | `CoreDocument` with TreeAnnotation (exact form depends on chosen parser) | -| Dependency parsing | X |[BiLexPCFGParser]()
[Exhaustive Dependency Parser]() |[DepParse Documentation]() | X | [Train own Model]() |tokenized, sentence-splitted and POS-tagged `CoreDocument` | `CoreDocument` with DependencyAnnotation (exact form depends on chosen parser) | -| Named Entity Recognition | X | [NER Classifier Combiner]()
[-RegexNERAnnotator]() | [NER Documentation]()| X | [NER Training Documentation]() | tokenized, ssplitted, pos-tagged, (lemmatized) `CoreDocument` | `CoreDocument` with `NamedEntityTagAnnotation` or `NormalizedNamedEntityTagAnnotation`| +| Constituency parsing | X | [Viterbi Parser]()
[Shift reduce Parser]()
[Iterative CKYPCFG Parser]()
[Fast Factored Parser]()
[Exhaustive PCFG Parser]() | [Constituency Parser Documentation]() | | | tokenized, sentence-splitted (and for some models POS-tagged) `CoreDocument` | `CoreDocument` with TreeAnnotation (exact form depends on chosen parser) | +| Dependency parsing | X |[BiLexPCFGParser]()
[Exhaustive Dependency Parser]() |[DepParse Documentation]() | X | [Train own Model]() |tokenized, sentence-splitted and POS-tagged `CoreDocument` | `CoreDocument` with DependencyAnnotation (exact form depends on chosen parser) | +| Named Entity Recognition | X | [NER Classifier Combiner]()
[-RegexNERAnnotator]() | [NER Documentation]()| X | [NER Training Documentation]() | tokenized, ssplitted, pos-tagged, (lemmatized) `CoreDocument` | `CoreDocument` with `NamedEntityTagAnnotation` or `NormalizedNamedEntityTagAnnotation`| | Functionalities extensible | X | [Custom annotator]() | | | | | Can consume own models | X | | [Example of including own (caseless) model]() | | | | @@ -399,11 +399,11 @@ The results of the survey are below. #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | -Simple tokenizer
-pattern tokenizer | [Tokenization Documentation](<>) | X | [Tokenizer Training Documentation]() | string of raw text | CoNLL format | +| Tokenization/segmentation | X | -Simple tokenizer
-pattern tokenizer | [Tokenization Documentation](<>) | X | [Tokenizer Training Documentation]() | string of raw text | CoNLL format | | Sentencing | X | | | X | [Sentence-Splitter Training Documentation]() | string of raw text | CoNLL format | | POS-tagging | X | | | X | [POS-Tagger Training Documentation]() | String of raw text | CoNLL format | | Constituency parsing | | | | | | | @@ -428,18 +428,18 @@ The results of the survey are below. #### Feature matrix -
+
| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | |---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | | [Tokenization Tutorial]()
[Advanced Tokenization Documentation]() | | | string of raw text | `TextBlob` data type - access through `words` property: WordList of word strings | -| Sentencing | X | | [Sentence-Splitting Tutorial]()
[Advanced Sentence-Splitting Documentation]() | | | string of raw text | `TextBlob` data type - access through `sentences` property: list of sentence objects| -| POS-tagging | X | -PatternTagger
-NLTKTagger | [POS-Tagger Tutorial]()
[POS-Tagger Advanced Usage]() | | | string of raw text | `TextBlob` data type - access through `tags` property: List of word string tag string tuples | -| Constituency parsing | X | | [Parser Tutorial]()
[Parser Advanced Usage]()| | | String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | -| Dependency parsing | X | |[Parser Tutorial]()
[Parser Advanced Usage]()| | |String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | +| Tokenization/segmentation | X | | [Tokenization Tutorial]()
[Advanced Tokenization Documentation]() | | | string of raw text | `TextBlob` data type - access through `words` property: WordList of word strings | +| Sentencing | X | | [Sentence-Splitting Tutorial]()
[Advanced Sentence-Splitting Documentation]() | | | string of raw text | `TextBlob` data type - access through `sentences` property: list of sentence objects| +| POS-tagging | X | -PatternTagger
-NLTKTagger | [POS-Tagger Tutorial]()
[POS-Tagger Advanced Usage]() | | | string of raw text | `TextBlob` data type - access through `tags` property: List of word string tag string tuples | +| Constituency parsing | X | | [Parser Tutorial]()
[Parser Advanced Usage]()| | | String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | +| Dependency parsing | X | |[Parser Tutorial]()
[Parser Advanced Usage]()| | |String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | | Named Entity Recognition | | | | | | | | | Functionalities extensible | | | | | | | | -| Can consume own models | X | | [Passing models into the Pipeline]()
[Training own data]() | | | | | +| Can consume own models | X | | [Passing models into the Pipeline]()
[Training own data]() | | | | | From 0d140baf4dbefaf2e47d4416615aab53b209a071 Mon Sep 17 00:00:00 2001 From: Stephan Druskat Date: Wed, 24 Feb 2021 19:31:05 +0100 Subject: [PATCH 37/39] Fix tables for use in mdbook --- src/architecture/extensibility/nlp-survey.md | 306 +++++++++---------- 1 file changed, 153 insertions(+), 153 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index b54e52a..f3f16bd 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -11,8 +11,8 @@ Specifically, the following properties were checked: 2. If the library is implemented in Java, does it use the [Apache Maven](https://maven.apache.org/) project management tool? 3. Does the library exist as an OSGi bundle? Specifically, do releases contain an OSGi manifest? 4. Is an OSGi bundle of the library available for consumption from a [p2](https://www.eclipse.org/equinox/p2/) repository? -5. Does it provide the following features? If it does, does it also provide extension mechanisms for the feature? Which are the input and output data types or formats? - 1. Tokenization/segmentation +5. Does it provide the following features? If it does, does it also provide extension mechanisms for the feature? Which are the input and Output types or formats? + 1. Tokenization/
segmentation 2. Sentencing 3. Part-of-speech tagging 4. Contituency parsing @@ -35,16 +35,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-SpacyTokenizer]()
[-PretrainedTransformerTokenizer]() | [Tokenizer Documentation]() | | | text as str | list[token] | -| Sentencing | X | -Sentence Splitter
-SpaCy Sentence Splitter | [Sentence-Splitter API]() | | | text as str | list[str] | -| POS-tagging | | | | X | [Train SentenceTaggerPredictor]() | sentence as str | dict[str, numpy.ndarray] | -| Constituency parsing | X | | [Constituency Parser Demo]() | | | | | -| Dependency parsing | X | | [Dependency Parser Demo]() | | | | | -| Named Entity Recognition | | | | X | [Train SentenceTaggerPredictor]() | sentence as str | dict[str, numpy.ndarray] | -| Functionalities extensible | | | | | | | | -| Can consume own models | X | | [Building own models in AllenNLP]() | | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- | :-------: | --------------------------------------------------------------------------------------------------------------------------- | --------------- | ------------------------ | +| Tokenization/
segmentation | X | [- Spacy Tokenizer]()
[- Pretrained Transformer Tokenizer]() | [Tokenizer Documentation]() | | | text as str | list[token] | +| Sentencing | X | - Sentence Splitter
- SpaCy Sentence Splitter | [Sentence Splitter API]() | | | text as str | list[str] | +| POS-tagging | | | | X | [Train Sentence Tagger Predictor]() | sentence as str | dict[str, numpy.ndarray] | +| Constituency parsing | X | | [Constituency Parser Demo]() | | | | | +| Dependency parsing | X | | [Dependency Parser Demo]() | | | | | +| Named Entity Recognition | | | | X | [Train Sentence Tagger Predictor]() | sentence as str | dict[str, numpy.ndarray] | +| Functionalities extensible | | | | | | | | +| Can consume own models | X | | [Building own models in AllenNLP]() | | | | |
@@ -62,16 +62,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable | Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation |X|[-Whitespace Tokenizer]()
[-Character Tokenizer]()
[-Maximum Entropy Tokenizer]() |[Tokenizer Documentation]()| X | [Tokenizer Training Documentation]() |text as string | -array of strings
-array of token spans| -| Sentencing|X|[-Newline Sentence-Splitter]()
[-Maximum Entropy Sentence-Splitter]() |[Sentence-Splitter Documentation]()| X | [Sentence-Splitter Training Documentation]() |text as string | -array of strings
-array of sentence spans | -| POS-tagging| X | | [POS-Tagger Documentation]() | x | [POS-Tagger Training Documentation]() | string array of tokens | string array of POS-tags | -| Constituency parsing | X | | [Constituency Parser Documentation]() | X | [Constituency Parser Documentation]()| string of whitespace tokenized sentence | array of OpenNLP's `Parse` type | -| Dependency parsing | | | | | | | | -| Named Entity Recognition | X | | [NER Documentaion]()| X | [NER Training Documentation]() | string array of tokens | array of name spans | -|Functionalities extensible|X| | [Extension writing Documentation]()| | | | -| Can consume own models | X | | [General API description]() | | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | :-------: | ---------------------------------------------------------------------------------------------------------------------------- | --------------------------------------- | ------------------------------------------------- | +| Tokenization/
segmentation | X | [- Whitespace Tokenizer]()
[- Character Tokenizer]()
[- Maximum Entropy Tokenizer]() | [Tokenizer Documentation]() | X | [Tokenizer Training docs]() | text as string | - array of strings
- array of token spans | +| Sentencing | X | [- Newline Sentence Splitter]()
[- Maximum Entropy Sentence Splitter]() | [Sentence Splitter Documentation]() | X | [Sentence Splitter Training docs]() | text as string | - array of strings
- array of sentence spans | +| POS-tagging | X | | [POS Tagger Documentation]() | x | [POS Tagger Training docs]() | string array of tokens | string array of POS tags | +| Constituency parsing | X | | [Constituency Parser Documentation]() | X | [Constituency Parser Documentation]() | string of whitespace tokenized sentence | array of OpenNLP's `Parse` type | +| Dependency parsing | | | | | | | | +| Named Entity Recognition | X | | [NER Documentaion]() | X | [NER Training docs]() | string array of tokens | array of name spans | +| Functionalities extensible | X | | [Extension writing Documentation]() | | | | +| Can consume own models | X | | [General API description]() | | | | |
@@ -90,16 +90,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | | [Tokenizer Documentation]() | | | input file in [raw format]() | output file in [line format]() | -| Sentencing | X | | [Sentence-Splitter Documentation]() | | | input file in [raw format]() | output file in [line format]() | -| POS-tagging | X | | | X | [POS-Tagger Documentation]() | input file in [raw format]() | output file in [tab separated values format]() | -| Constituency parsing | | | | | | | | -| Dependency parsing | X | | [DepParse Documentation]() | X | [General Training Documentation]() | input file in [raw format]() | output file in [tab separated values format]() | -| Named Entity Recognition | X | | [NER Documentation]() | X | [General Training Documentation]() | input file in [raw format]() | output file in [tab separated values format]() | -| Functionalities extensible | X | | [Configuration documentation]() | | | | | -| Can consume own models | X | | [How to add models]() | | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ---------------- | -------------------------------------------------------------------------------------------------------------------------- | :-------: | ------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Tokenization/
segmentation | X | | [Tokenizer Documentation]() | | | input file in [raw format]() | output file in [line format]() | +| Sentencing | X | | [Sentence Splitter Documentation]() | | | input file in [raw format]() | output file in [line format]() | +| POS-tagging | X | | | X | [POS Tagger Documentation]() | input file in [raw format]() | output file in [tab separated values format]() | +| Constituency parsing | | | | | | | | +| Dependency parsing | X | | [Dep Parse Documentation]() | X | [General Training docs]() | input file in [raw format]() | output file in [tab separated values format]() | +| Named Entity Recognition | X | | [NER Documentation]() | X | [General Training docs]() | input file in [raw format]() | output file in [tab separated values format]() | +| Functionalities extensible | X | | [Configuration documentation]() | | | | | +| Can consume own models | X | | [How to add models]() | | | | |
@@ -119,16 +119,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | | [General description]() | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | -| Sentencing | | | | | | | | -| POS-tagging | X | | [General description]() | | | string of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each annotation | -| Constituency parsing | X | | [General description]()
[Parser documentation]( ) | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | -| Dependency parsing | X | -Stanford NLP Parser
-CogComp Parser (requires POS-Tagger and Chunker as part of the Pipeline) | [-General description]()
[-Stanford Parser documentation]( ) | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | -| Named Entity Recognition | X (isn't part of the Pipeline) | | [NER Documentation]() | X | [Training Documentation]() | string of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each annotation | -| Functionalities extensible | X | | [Configuration documentation]() | | | | | -| Can consume own models | X | | [Use own Tokenizer]() | | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----------------------------: | ---------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | ----------------------------------------------------------------------------------------------------------------------- | ------------------ | ------------------------------------------------------------------------------------ | +| Tokenization/
segmentation | X | | [General description]() | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | +| Sentencing | | | | | | | | +| POS-tagging | X | | [General description]() | | | string of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each annotation | +| Constituency parsing | X | | [General description]()
[Parser documentation]( ) | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | +| Dependency parsing | X | - Stanford NLP Parser
- CogComp Parser (requires POS-Tagger and Chunker as part of the Pipeline) | [- General description]()
[- Stanford Parser documentation]( ) | | | string of raw text | `TextAnnotation` data structure with method `.getView(Viewname)` for each annotation | +| Named Entity Recognition | X (isn't part of the Pipeline) | | [NER Documentation]() | X | [Training docs]() | string of raw text | `TextAnnotation` data structure with `.getView(Viewname)` for each annotation | +| Functionalities extensible | X | | [Configuration documentation]() | | | | | +| Can consume own models | X | | [Use own Tokenizer]() | | | | |
@@ -147,16 +147,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | | [ANNIE Tokenizer Documentation]( )| | | string of document text | output file | -| Sentencing | X | -ANNIE Default Sentence Splitter
-ANNIE Regex Sentence Splitter | [ANNIE Sentence Splitter Documentation]( ) | | | | -| POS-tagging | X | | [ANNIE POS-Tagger Documentation]() | | | | -| Constituency parsing | | | | | | | -| Dependency parsing | | | | | | | -| Named Entity Recognition | x | | [Gazeteer documentation]()
[-Semantic Tagger]() | | | | | -| Functionalities extensible | | | | | | | -| Can consume own models | | | | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----------------------: | --------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | ------------- | ----------------------- | ----------- | +| Tokenization/
segmentation | X | | [ANNIE Tokenizer Documentation]( ) | | | string of document text | output file | +| Sentencing | X | - ANNIE Default Sentence Splitter
- ANNIE Regex Sentence Splitter | [ANNIE Sentence Splitter Documentation]( ) | | | | +| POS-tagging | X | | [ANNIE POS Tagger Documentation]() | | | | +| Constituency parsing | | | | | | | +| Dependency parsing | | | | | | | +| Named Entity Recognition | x | | [Gazeteer documentation]()
[- Semantic Tagger]() | | | | | +| Functionalities extensible | | | | | | | +| Can consume own models | | | | | | |
@@ -174,16 +174,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-IndoEuropeanTokenizer]()
[-CharacterTokenizer]()
[-RegExTokenizer]()
[-NGramTokenizer]()
[-LineTokenizerFactory]() | [Lingpipe Book]() Chapter 3 (p.33)
[Tokenizer API]()
[Tokenization API]( )| | | -string of text
-char[] array, startindex, endindex | array of token strings | -| Sentencing | X | | [Sentence-Splitter Tutorial]( )| X | | char[] array of text | sentences as set of chunks | -| POS-tagging | X | [-Chain CRF Tagger]()
[-Classifier Tagger]()
[-HMMDecoder]() | [-Lingpipe Book Chapter 11]()
[-POS-Tagger Tutorial]() | X | [POS-Tagger Tutorial Paragraph: Training Training Part-of-Speech Models]() | list of tokens as strings | [Tagging]() | -| Constituency parsing | | | | | | | -| Dependency parsing | | | | | | | | -| Named Entity Recognition | X | | [NER Documentation]()| | | | | -| Functionalities extensible | X | | [Paragraph 3. Evaluating and Tuning Tagging Models]() | | | | | -| Can consume own models | X | | [Developing and Tuning Sentence Models]() | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----------------------: | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | ----------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------- | ------------------------------------------------------------------------ | +| Tokenization/
segmentation | X | [- IndoEuropeanTokenizer]()
[- CharacterTokenizer]()
[- RegEx Tokenizer]()
[- NGram Tokenizer]()
[- Line Tokenizer Factory]() | [Lingpipe Book]() Chapter 3 (p.33)
[Tokenizer API]()
[Tokenization API]( ) | | | - string of text
- char[] array, startindex, endindex | array of token strings | +| Sentencing | X | | [Sentence Splitter Tutorial]( ) | X | | char[] array of text | sentences as set of chunks | +| POS-tagging | X | [- Chain CRF Tagger]()
[- Classifier Tagger]()
[- HMM Decoder]() | [- Lingpipe Book Chapter 11]()
[- POS-Tagger Tutorial]() | X | [POS Tagger Tutorial Paragraph: Training Part-of-Speech Models]() | list of tokens as strings | [Tagging]() | +| Constituency parsing | | | | | | | +| Dependency parsing | | | | | | | | +| Named Entity Recognition | X | | [NER Documentation]() | | | | | +| Functionalities extensible | X | | [Paragraph 3. Evaluating and Tuning Tagging Models]() | | | | | +| Can consume own models | X | | [Developing and Tuning Sentence Models]() | | | |
@@ -202,16 +202,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | | | | | | | -| Sentencing | | | | | | | -| POS-tagging | X | X | [POS-Tagger Documentation]() | X|[POS-Tagger Training Documentation]() | | | -| Constituency parsing | | | | | | | -| Dependency parsing | X | X | [DepParse Documentation]() | | | - Filepath with POS-tagged Dataset in CONNLL-U format
-list of list of `ConllEntry`, where each entry represents a POS-tagged token and each nested list a sentence | | -| Named Entity Recognition | X | [Neural Tagger- CNNLSTM module]()
[Neural Tagger- IDCNN]()
[Transformer Token Classifier]() | [NER Documentation]() | X | [NER Training Documentation]() | | | -| Trainable models | X | | | | | | -| Can consume own models | X | | | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----------------------: | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------- | :--------------------------: | -------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------ | +| Tokenization/
segmentation | | | | | | | +| Sentencing | | | | | | | +| POS-tagging | X | X | [POS Tagger Documentation]() | X | [POS Tagger Training docs]() | | | +| Constituency parsing | | | | | | | +| Dependency parsing | X | X | [DepParse Documentation]() | | | - Filepath with POS-tagged Dataset in CONNLL-U format
- list of list of `ConllEntry`, where each entry represents a POS-tagged token and each nested list a sentence | | +| Named Entity Recognition | X | [Neural Tagger- CNNLSTM module]()
[Neural Tagger- IDCNN]()
[Transformer Token Classifier]() | [NER Documentation]() | X | [NER Training docs]() | | | +| Trainable models | X | | | | | | +| Can consume own models | X | | | | | |
@@ -229,16 +229,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | | [Tokenizer basic information]()
[Tokenizer Demo]() | X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | list of `Token` types | -| Sentencing | | | | | | | | -| POS-tagging | X | | [POS-Tag basic information]() | X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | output-file with annotations| -| Constituency parsing | | | | | | | | -| Dependency parsing | X | | [Dependency Parse basic information]() | X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | output-file with annotations | -| Named Entity Recognition | X | | [NER Documentation]()| X | [Basic Training information]() | -string line of text
-raw document text
tsv-format | output-file with annotations | -| Functionalities extensible | | | | | | | | -| Can consume own models | | | | | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | -------------------------------------------------------------------------------------- | ------------------------------------------------------------ | ---------------------------- | +| Tokenization/
segmentation | X | | [Tokenizer basic information]()
[Tokenizer Demo]() | X | [Basic Training information]() | - string line of text
- raw document text
tsv-format | list of `Token` types | +| Sentencing | | | | | | | | +| POS-tagging | X | | [POS Tag basic information]() | X | [Basic Training information]() | - string line of text
- raw document text
tsv-format | output-file with annotations | +| Constituency parsing | | | | | | | | +| Dependency parsing | X | | [Dependency Parse basic information]() | X | [Basic Training information]() | - string line of text
- raw document text
tsv-format | output-file with annotations | +| Named Entity Recognition | X | | [NER Documentation]() | X | [Basic Training information]() | - string line of text
- raw document text
tsv-format | output-file with annotations | +| Functionalities extensible | | | | | | | | +| Can consume own models | | | | | | | |
@@ -257,16 +257,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-Casual module]()
[-Destructive module]()
[-Regexp Tokenizer]()
[-Repp Tokenizer]()
[-Simple Tokenizer]()
[-Stanford Segmenter]()
[-TokTok Tokenizer]()
[-Penn Treebank Tokenizer]()| [-NLTK Tokenizer/Sentence Splitter Documentation]()
[-NLTK Book Chapter 3]() | | | str of text | list(str) of token | -| Sentencing | X | [-Punkt Sent Module]()
[-Regexp Tokenizer]()
[-Simple Tokenizer]() | [-NLTK Tokenizer/Sentence Splitter Documentation]()
[-NLTK Book Chapter 3]() |X | [-Punkt Sent Module]() | str of text | list(str) of sentences | -| POS-tagging | X | [-Brill Tagger]()
[-CRF Tagger]()
[-HMM Tagger]()
[-HunPos Tagger]()
[-Perceptron Tagger]()
[-Senna POS Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]()| [-NLTK Tagger Documentation]()
[-NLTK Book Chapter 5]() | X| [-Brill Tagger Training]()
[-CRF Tagger]()
[-HMM Tagger]()
[HunPos Tagger]()
[-Perceptron Tagger]()
[-Sequential Backoff Tagger]()
[-Stanford Tagger]()
[-TNT Tagger]() | -tokens (list(str))
-sentences (list(list(str))) | -list(tuple(str, str))
-list(list(tuple(str, str)) ) | -| Constituency parsing | X | [-Early Chart Parser]()
[-Recursive descent Parser]()
[-Shift Reduce Parser]()
[-Standford Parser]()
| [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8]() | | | sent (list(str))
sentences (list(list(str))) | -iter(Tree)
-iter(iter(Tree)) | -| Dependency parsing | X | [-CoreNLP Dependency Parser]()
[-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | [-NLTK Parser Documentation]()
[-NLTK Book - Chapter 8 Pargraph 5]() | X | [-Malt Parser]()
[-Nonprojective Dependency Parser]()
[-Projective Dependency Parser]()
[-Transition Parser]() | -CoreNLP: sentences (list(str)) – input sentences to parse
-Malt: str sentences | CoreNLP: iter(iter(Tree))
Malt:iter(DependencyGraph) | -| Named Entity Recognition | X | | [-NLTK Book Chapter 7 Paragraph 5](< https://www.nltk.org/book/ch07.html>)
[-NE Chunker]()| X | [-NE Chunker]()| list of POS-tagged tokens | NE-tagged parse-tree | -| Functionalities extensible | | | | | | | | -| Can consume own models | X | | | | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | ---------------------------------------------------------- | +| Tokenization/
segmentation | X | [- Casual module]()
[- Destructive module]()
[- Regexp Tokenizer]()
[- Repp Tokenizer]()
[- Simple Tokenizer]()
[- Stanford Segmenter]()
[- TokTok Tokenizer]()
[- Penn Treebank Tokenizer]() | [- NLTK Tokenizer/Sentence Splitter Documentation]()
[- NLTK Book Chapter 3]() | | | str of text | list(str) of token | +| Sentencing | X | [- Punkt Sent Module]()
[- Regexp Tokenizer]()
[- Simple Tokenizer]() | [- NLTK Tokenizer/Sentence Splitter Documentation]()
[- NLTK Book Chapter 3]() | X | [- Punkt Sent Module]() | str of text | list(str) of sentences | +| POS-tagging | X | [- Brill Tagger]()
[- CRF Tagger]()
[- HMM Tagger]()
[- HunPos Tagger]()
[- Perceptron Tagger]()
[- Senna POS Tagger]()
[- Sequential Backoff Tagger]()
[- Stanford Tagger]()
[- TNT Tagger]() | [- NLTK Tagger Documentation]()
[- NLTK Book Chapter 5]() | X | [- Brill Tagger Training]()
[- CRF Tagger]()
[- HMM Tagger]()
[HunPos Tagger]()
[- Perceptron Tagger]()
[- Sequential Backoff Tagger]()
[- Stanford Tagger]()
[- TNT Tagger]() | - tokens (list(str))
- sentences (list(list(str))) | - list(tuple(str, str))
- list(list(tuple(str, str)) ) | +| Constituency parsing | X | [- Early Chart Parser]()
[- Recursive descent Parser]()
[- Shift Reduce Parser]()
[- Standford Parser]()
| [- NLTK Parser Documentation]()
[- NLTK Book - Chapter 8]() | | | sent (list(str))
sentences (list(list(str))) | - iter(Tree)
- iter(iter(Tree)) | +| Dependency parsing | X | [- CoreNLP Dependency Parser]()
[- Malt Parser]()
[- Nonprojective Dependency Parser]()
[- Projective Dependency Parser]()
[- Transition Parser]() | [- NLTK Parser Documentation]()
[- NLTK Book - Chapter 8 Pargraph 5]() | X | [- Malt Parser]()
[- Nonprojective Dependency Parser]()
[- Projective Dependency Parser]()
[- Transition Parser]() | - CoreNLP: sentences (list(str)) – input sentences to parse
- Malt: str sentences | CoreNLP: iter(iter(Tree))
Malt:iter(DependencyGraph) | +| Named Entity Recognition | X | | [- NLTK Book Chapter 7 Paragraph 5](< https://www.nltk.org/book/ch07.html>)
[- NE Chunker]() | X | [- NE Chunker]() | list of POS-tagged tokens | NE-tagged parse-tree | +| Functionalities extensible | | | | | | | | +| Can consume own models | X | | | | | | |
@@ -285,16 +285,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | |[General documentation]() | | | string of text | nested list : [sentences[chunks[tokens]]] | -| Sentencing | X | | [General documentation]()| | | string of text | nested list : [sentences[chunks[tokens]]] | | | -| POS-tagging | X | | [General documentation]() | | | string of text | nested list : [sentences[chunks[tokens]]] | -| Constituency parsing | | | | | | | | -| Dependency parsing | | | | | | | | -| Named Entity Recognition | | | | | | | | -| Functionalities extensible | | | | | | | -| Can consume own models | | | | | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ---------------- | ---------------------------------------------------------------- | :-------: | ------------- | -------------- | ----------------------------------------- | +| Tokenization/
segmentation | X | | [General documentation]() | | | string of text | nested list : [sentences[chunks[tokens]]] | +| Sentencing | X | | [General documentation]() | | | string of text | nested list : [sentences[chunks[tokens]]] | | | +| POS-tagging | X | | [General documentation]() | | | string of text | nested list : [sentences[chunks[tokens]]] | +| Constituency parsing | | | | | | | | +| Dependency parsing | | | | | | | | +| Named Entity Recognition | | | | | | | | +| Functionalities extensible | | | | | | | +| Can consume own models | | | | | | | |
@@ -317,17 +317,17 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | | [Tokenizer Documentaion]( )| | |raw document text | SpaCy's `Doc` type - tokens accessible through `token.text` property | -| Sentencing | X | | [Sentencizer Documentation]() | | | SpaCy's `Doc` type | SpaCy's `Doc` type - sentence access through `doc.sents` property | -| POS-tagging | X | | [POS-Tagger Documentation]() |X | [Training Basics]()
[Updating POS-Tagger]() | Spacy's `Doc` type |Spacy's `Doc` type - POS-Tags access through`token.pos_` attribute| -| Constituency parsing | | | | | | | | -| Dependency parsing | X | | [Parser Documentation]() | X| [Training Basics]()
[Updating Dependencs Parser]() | Spacy's `Doc` type | Spacy's `Doc` type - Parse-tags access through `token.dep_` attribute | -| Named Entity Recognition | X | | [NER Documentation]()| | |SpaCy's `Doc` type | SpaCy's `Doc` type - entity-tags access through `doc.ents` property| -| Functionalities extensible | X | | [Extend Tokenizer]()
[Customize Tokenizer]()
[Custom Component]() | | | | -| Can consume own models | (X) | | [Load different Tokenizer]() - | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | --------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | --------------------------------------------------------------------- | +| Tokenization/
segmentation | X | | [Tokenizer Documentaion]( ) | | | raw document text | SpaCy's `Doc` type - tokens accessible through `token.text` property | +| Sentencing | X | | [Sentencizer Documentation]() | | | SpaCy's `Doc` type | SpaCy's `Doc` type - sentence access through `doc.sents` property | +| POS-tagging | X | | [POS Tagger Documentation]() | X | [Training Basics]()
[Updating POS Tagger]() | Spacy's `Doc` type | Spacy's `Doc` type - POS-Tags access through`token.pos_` attribute | +| Constituency parsing | | | | | | | | +| Dependency parsing | X | | [Parser Documentation]() | X | [Training Basics]()
[Updating Dependencs Parser]() | Spacy's `Doc` type | Spacy's `Doc` type - Parse-tags access through `token.dep_` attribute | +| Named Entity Recognition | X | | [NER Documentation]() | | | SpaCy's `Doc` type | SpaCy's `Doc` type - entity-tags access through `doc.ents` property | +| Functionalities extensible | X | | [Extend Tokenizer]()
[Customize Tokenizer]()
[Custom Component]() | | | | +| Can consume own models | (X) | | [Load different Tokenizer]() | +| | | |
@@ -345,16 +345,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [^2] | [Tokenizer Documentation]() | | | `Document` type | `Token` types | -| Sentencing | X | [^2] | [Sentence Detector Documentation]() | | | `Document` type | `Sentence` types | -| POS-tagging | X | [^2] | [POS-Tagger Documentation]() | X | [POS Training Documentation]()
[General Training Documentation]()| `Document` type
`Token`types | `POS` Type | -| Constituency parsing | | | | | | | | -| Dependency parsing | X | [-Typed Dependency Parser]()
[Untyped Dependency Parser]() | [Dependency Parser Documentation]()| X | [General Training Documentation]() | `Document` type
`POS` type
`Token` type | (unlabeled) `Dependency` types | -| Named Entity Recognition | X | [-NER CRF]()
[-NER DL]()| [NER Documentation](<>)| X | [-NER CRF]()
[-NER DL]() | | | -| Functionalities extensible | X | | [Manipulating Pipelines]() | | | `Document` type | `Named_Entity` types | -| Can consume own models | X | | [General Concept Documentation]() | | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- | :-------: | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------- | ------------------------------ | +| Tokenization/
segmentation | X | [^2] | [Tokenizer Documentation]() | | | `Document` type | `Token` types | +| Sentencing | X | [^2] | [Sentence Detector Documentation]() | | | `Document` type | `Sentence` types | +| POS-tagging | X | [^2] | [POS Tagger Documentation]() | X | [POS Training docs]()
[General Training docs]() | `Document` type
`Token`types | `POS` Type | +| Constituency parsing | | | | | | | | +| Dependency parsing | X | [- Typed Dependency Parser]()
[Untyped Dependency Parser]() | [Dependency Parser Documentation]() | X | [General Training docs]() | `Document` type
`POS` type
`Token` type | (unlabeled) `Dependency` types | +| Named Entity Recognition | X | [- NER CRF]()
[- NER DL]() | [NER Documentation](<>) | X | [- NER CRF]()
[- NER DL]() | | | +| Functionalities extensible | X | | [Manipulating Pipelines]() | | | `Document` type | `Named_Entity` types | +| Can consume own models | X | | [General Concept Documentation]() | | | | |
@@ -374,16 +374,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | [-Abstract Tokenizer]()
[-CHBT Tokenizer]()
[-Lexer Tokenizer]()
[-NegraPennTokenizer]()
[-PennTreebankTokenizer]()
[-PTBTokenizer]()
[-Robust Tokenizer]()
[-WhitespaceTokenizer]() | [Tokenizer Documentation]() | | | -string of document text
-CoreNLPs `CoreDocument` (Instatiated with string document text) | -list of strings
-list of characteroffsetbegin indices
-list of characteroffsetendindices
`CoreDocument` with previous annotation properties | -| Sentencing | X | | [Sentencer Documentation]() | | | tokenized `CoreDocument` | `CoreDocument` with Sentence List of POS-Tags as property | -| POS-tagging | X | | [POS-Tag Documentation]() | X | | tokenized and sentence-splitted `CoreDocument` | `CoreDocument` with String List of POS-Tags as property | -| Constituency parsing | X | [Viterbi Parser]()
[Shift reduce Parser]()
[Iterative CKYPCFG Parser]()
[Fast Factored Parser]()
[Exhaustive PCFG Parser]() | [Constituency Parser Documentation]() | | | tokenized, sentence-splitted (and for some models POS-tagged) `CoreDocument` | `CoreDocument` with TreeAnnotation (exact form depends on chosen parser) | -| Dependency parsing | X |[BiLexPCFGParser]()
[Exhaustive Dependency Parser]() |[DepParse Documentation]() | X | [Train own Model]() |tokenized, sentence-splitted and POS-tagged `CoreDocument` | `CoreDocument` with DependencyAnnotation (exact form depends on chosen parser) | -| Named Entity Recognition | X | [NER Classifier Combiner]()
[-RegexNERAnnotator]() | [NER Documentation]()| X | [NER Training Documentation]() | tokenized, ssplitted, pos-tagged, (lemmatized) `CoreDocument` | `CoreDocument` with `NamedEntityTagAnnotation` or `NormalizedNamedEntityTagAnnotation`| -| Functionalities extensible | X | [Custom annotator]() | | | | -| Can consume own models | X | | [Example of including own (caseless) model]() | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------- | :-------: | ------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Tokenization/
segmentation | X | [- Abstract Tokenizer]()
[- CHBT Tokenizer]()
[- Lexer Tokenizer]()
[- Negra Penn Tokenizer]()
[- Penn Treebank Tokenizer]()
[- PTB Tokenizer]()
[- Robust Tokenizer]()
[- WhitespaceTokenizer]() | [Tokenizer Documentation]() | | | - string of document text
- CoreNLPs `CoreDocument` (Instatiated with string document text) | - list of strings
- list of characteroffsetbegin indices
- list of characteroffsetendindices
`CoreDocument` with previous annotation properties | +| Sentencing | X | | [Sentencer Documentation]() | | | tokenized `CoreDocument` | `CoreDocument` with Sentence List of POS-Tags as property | +| POS-tagging | X | | [POS Tag Documentation]() | X | | tokenized and sentence-splitted `CoreDocument` | `CoreDocument` with String List of POS-Tags as property | +| Constituency parsing | X | [Viterbi Parser]()
[Shift reduce Parser]()
[Iterative CKYPCFG Parser]()
[Fast Factored Parser]()
[Exhaustive PCFG Parser]() | [Constituency Parser Documentation]() | | | tokenized, sentence-splitted (and for some models POS-tagged) `CoreDocument` | `CoreDocument` with TreeAnnotation (exact form depends on chosen parser) | +| Dependency parsing | X | [BiLexPCFGParser]()
[Exhaustive Dependency Parser]() | [Dep Parse Documentation]() | X | [Train own Model]() | tokenized, sentence-splitted and POS-tagged `CoreDocument` | `CoreDocument` with DependencyAnnotation (exact form depends on chosen parser) | +| Named Entity Recognition | X | [NER Classifier Combiner]()
[- Regex NER Annotator]() | [NER Documentation]() | X | [NER Training docs]() | tokenized, ssplitted, pos-tagged, (lemmatized) `CoreDocument` | `CoreDocument` with `Named Entity Tag Annotation` or `Normalized Named Entity Tag Annotation` | +| Functionalities extensible | X | [Custom annotator]() | | | | +| Can consume own models | X | | [Example of including own (caseless) model]() | | | |
@@ -401,16 +401,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | -Simple tokenizer
-pattern tokenizer | [Tokenization Documentation](<>) | X | [Tokenizer Training Documentation]() | string of raw text | CoNLL format | -| Sentencing | X | | | X | [Sentence-Splitter Training Documentation]() | string of raw text | CoNLL format | -| POS-tagging | X | | | X | [POS-Tagger Training Documentation]() | String of raw text | CoNLL format | -| Constituency parsing | | | | | | | -| Dependency parsing | X | |[DepParser Documentation (under construction)]() | X | [DepParser Training Documentation]() | string of raw text | CoNLL format | -| Named Entity Recognition | | | | | | | | -| Functionalities extensible | X | | [Advanced Usage]() | | | | | -| Can consume own models | | | | | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | ------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------- | :-------: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | ------------ | +| Tokenization/
segmentation | X | - Simple tokenizer
- pattern tokenizer | [Tokenization Documentation](<>) | X | [Tokenizer Training docs]() | string of raw text | CoNLL format | +| Sentencing | X | | | X | [Sentence Splitter Training docs]() | string of raw text | CoNLL format | +| POS-tagging | X | | | X | [POS Tagger Training docs]() | String of raw text | CoNLL format | +| Constituency parsing | | | | | | | +| Dependency parsing | X | | [DepParser Documentation (under construction)]() | X | [DepParser Training docs]() | string of raw text | CoNLL format | +| Named Entity Recognition | | | | | | | | +| Functionalities extensible | X | | [Advanced Usage]() | | | | | +| Can consume own models | | | | | | | | @@ -430,16 +430,16 @@ The results of the survey are below.
-| | Has functionality | Multiple Options | Functionality documentation | Is Trainable| Training documentation | Input data | Output data | -|---------------------------|:--------------------------:|--------------------------|-----------------------------|:----------:|--------------------------------------|-------------|-------------| -| Tokenization/segmentation | X | | [Tokenization Tutorial]()
[Advanced Tokenization Documentation]() | | | string of raw text | `TextBlob` data type - access through `words` property: WordList of word strings | -| Sentencing | X | | [Sentence-Splitting Tutorial]()
[Advanced Sentence-Splitting Documentation]() | | | string of raw text | `TextBlob` data type - access through `sentences` property: list of sentence objects| -| POS-tagging | X | -PatternTagger
-NLTKTagger | [POS-Tagger Tutorial]()
[POS-Tagger Advanced Usage]() | | | string of raw text | `TextBlob` data type - access through `tags` property: List of word string tag string tuples | -| Constituency parsing | X | | [Parser Tutorial]()
[Parser Advanced Usage]()| | | String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | -| Dependency parsing | X | |[Parser Tutorial]()
[Parser Advanced Usage]()| | |String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | -| Named Entity Recognition | | | | | | | | -| Functionalities extensible | | | | | | | | -| Can consume own models | X | | [Passing models into the Pipeline]()
[Training own data]() | | | | | +| | Avail. | Multiple Options | Funct. docs | Trainable | Training docs | Input | Output | +| ------------------------------ | :----: | -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------: | ------------- | ------------------ | -------------------------------------------------------------------------------------------- | +| Tokenization/
segmentation | X | | [Tokenization Tutorial]()
[Advanced Tokenization Documentation]() | | | string of raw text | `TextBlob` data type - access through `words` property: WordList of word strings | +| Sentencing | X | | [Sentence Splitting Tutorial]()
[Advanced Sentence Splitting Documentation]() | | | string of raw text | `TextBlob` data type - access through `sentences` property: list of sentence objects | +| POS-tagging | X | - PatternTagger
- NLTKTagger | [POS Tagger Tutorial]()
[POS Tagger Advanced Usage]() | | | string of raw text | `TextBlob` data type - access through `tags` property: List of word string tag string tuples | +| Constituency parsing | X | | [Parser Tutorial]()
[Parser Advanced Usage]() | | | String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | +| Dependency parsing | X | | [Parser Tutorial]()
[Parser Advanced Usage]() | | | String of raw text | `TextBlob` data type - access through `parse()` method: TaggedString | +| Named Entity Recognition | | | | | | | | +| Functionalities extensible | | | | | | | | +| Can consume own models | X | | [Passing models into the Pipeline]()
[Training own data]() | | | | | From 7e93bbbc9198ed306d1d48510fdaa3199e677b32 Mon Sep 17 00:00:00 2001 From: Stephan Druskat Date: Wed, 24 Feb 2021 20:03:01 +0100 Subject: [PATCH 38/39] Externalize library links and add conclusion - Links to the libraries' websites made it impossible to link to sections internally, therefore all links have been moved to a block below the section headings --- src/architecture/extensibility/nlp-survey.md | 78 ++++++++++++++++---- 1 file changed, 63 insertions(+), 15 deletions(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index f3f16bd..336a821 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -22,7 +22,27 @@ Specifically, the following properties were checked: The results of the survey are below. -### [AllenNLP]() +## Conclusion + +The survey helped us to reduce the number of suitable candidates for integration in Hexatomic to four. +In a first step we eliminated libraries based on their usefulness as all-purpose libraries and feature set, their implementation language, their development status and up-to-dateness, and our own experiences. +Thus, we excluded most Python-based libraries, with the exeption of NLTK and SpaCy, of which we eliminated NLTK due to its organically grown ecosystemic nature and the integration inpracticabilities this would bring with it. +SpaCy remained included as it promised implementations closer to the state of the art. + +This left us with four potential candidates: [Apache OpenNLP](#apache-opennlp), [SpaCy](#spacy), [Spark NLP](#spark-nlp), and [Stanford CoreNLP](#stanford-corenlp). + +Of those, the only candidate in Java that is feature-complete was Stanford CoreNLP: +Apache OpenNLP did not seem to support dependency parsing, SparkNLP as well as SpaCy did not seem to support constituency parsing (directly). + +In the end, we settled on integration of Stanford CoreNLP in Hexatomic. +It seems to cater best for the core target group of early versions of Hexatomic, i.e., users with a strong linguistic rather than NLP background. +It also seemed architecturally easiest to integrate, as we will not have to bridge between Java dn Python code at this stage. + +When Stanford CoreNLP is successfully integrated in Hexatomic, we will, however, attempt to integrate another, more state-of-the-art library, for which SpaCy and the unsurveyed [flairNLP](https://github.com/flairNLP/flair) seem to be best suited. + +### AllenNLP + +> [AllenNLP website](https://allennlp.org/) 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -50,7 +70,9 @@ The results of the survey are below.
-### [Apache OpenNLP]() +### Apache OpenNLP + +> [Apache OpenNLP website](https://opennlp.apache.org/) 1. [x] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -77,7 +99,9 @@ The results of the survey are below. -### [Clear NLP]() +### Clear NLP + +> [Clear NLP website]() 1. [X] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -106,7 +130,9 @@ The results of the survey are below. -### [CogComp NLP Pipeline]() +### CogComp NLP Pipeline + +> [CogComp NLP Pipeline website]() 1. [X] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -134,7 +160,9 @@ The results of the survey are below.
-### [GATE & ANNIE]() +### GATE & ANNIE + +> [GATE & ANNIE website]() 1. [X] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -161,7 +189,9 @@ The results of the survey are below.
-### [Lingpipe]() +### Lingpipe + +> [Lingpipe website]() 1. [X] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -189,7 +219,9 @@ The results of the survey are below.
-### [NLP Architect]() +### NLP Architect + +> [NLP Architect website]() 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -216,7 +248,9 @@ The results of the survey are below.
-### [NLP4J]() +### NLP4J + +> [NLP4J website]() 1. [x] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -244,7 +278,9 @@ The results of the survey are below.
-### [NLTK]() +### NLTK + +> [NLTK website]() 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -272,7 +308,9 @@ The results of the survey are below.
-### [Pattern]() +### Pattern + +> [Pattern website]() 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -304,7 +342,9 @@ The results of the survey are below. -### [SpaCy]() +### SpaCy + +> [SpaCy website]() 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -332,7 +372,9 @@ The results of the survey are below.
-### [Spark NLP]() +### Spark NLP + +> [Spark NLP website]() 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -361,7 +403,9 @@ The results of the survey are below. [^2]:Spark NLP provides a [library of pretrained Pipelines]() and a [library of models](). This survey however refers to the general Annotators. -### [Stanford CoreNLP]() +### Stanford CoreNLP + +> [Stanford CoreNLP website]() 1. [X] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -388,7 +432,9 @@ The results of the survey are below.
-### [Talismane]() +### Talismane + +> [Talismane website]() 1. [X] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java @@ -417,7 +463,9 @@ The results of the survey are below.
-### [TextBlob]() +### TextBlob + +> [TextBlob website]() 1. [ ] Implemented in Java 1. [ ] Not Java, but API can be addressed from Java From 20a8c294f7fc76b953b6aa8f3d8d8608f96bc64d Mon Sep 17 00:00:00 2001 From: Stephan Druskat Date: Wed, 3 Mar 2021 21:16:21 +0100 Subject: [PATCH 39/39] Fix typo Co-authored-by: Thomas Krause --- src/architecture/extensibility/nlp-survey.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/architecture/extensibility/nlp-survey.md b/src/architecture/extensibility/nlp-survey.md index 336a821..36714d2 100644 --- a/src/architecture/extensibility/nlp-survey.md +++ b/src/architecture/extensibility/nlp-survey.md @@ -36,7 +36,7 @@ Apache OpenNLP did not seem to support dependency parsing, SparkNLP as well as S In the end, we settled on integration of Stanford CoreNLP in Hexatomic. It seems to cater best for the core target group of early versions of Hexatomic, i.e., users with a strong linguistic rather than NLP background. -It also seemed architecturally easiest to integrate, as we will not have to bridge between Java dn Python code at this stage. +It also seemed architecturally easiest to integrate, as we will not have to bridge between Java and Python code at this stage. When Stanford CoreNLP is successfully integrated in Hexatomic, we will, however, attempt to integrate another, more state-of-the-art library, for which SpaCy and the unsurveyed [flairNLP](https://github.com/flairNLP/flair) seem to be best suited.