TextSentencer is a simple rule-based tool for segmenting text into sentences.
http://bionlp.dbcls.jp/text_sentencer
ruby version 1.9.2 or above
Use the gem command of ruby to download and install text_sentencer.
> gem install text_sentencer> echo "This is a sentence. This is another." | text_sentencer
This is a sentence.
This is another.or
> text_sentencer filenameor
> text_sentencer < filename> echo "This is a sentence" | text_sentencer -c custom_rules.jsonTo get the result in JSON (in PubAnnotation scheme)
> echo "This is a sentence. This is another." | text_sentencer -j
{
"text": "This is a sentence. This is another.\n",
"denotations": [
{
"span": {"begin": 0, "end": 19},
"obj": "Sentence"
},
{
"span": {"begin": 20, "end": 36},
"obj": "Sentence"
}
]
}
#!/usr/bin/env ruby
require 'text_sentencer'
text = "This is a sentence. This is another."
sentencer = TextSentencer.new
annotation = sentencer.annotate(text)
annotation[:denotations].each do |d|
span = d[:span]
puts text[span[:begin]...span[:end]]
endThe rule system of text_sentencer consists of four components.
- In a text, every position that matches the break_pattern gets a setence break.
- In a text, every position that matches the candidate_pattern is regarded as a candidate of sentence break.
- For each break candidate, each rule in positive_rules is tested. If a matching rule is found, the candiate gets a sentence break.
- Each rule consists of two regular expressions (in PCRE syntax).
- The first RE is applied to the string preceding the break candidate. '\Z' will be automatically added to the end of the first RE, to indicate the end of the string.
- The second RE is applied to the string following the break candidate. '\A' will be automatically added to the beginning of the second RE, to indicate the beginning of the string.
- For each break candidate that gets a sentence break by a positive rule, each rule in negative_rules is tested. If a matching rule is found, the sentence break is cancelled.
The defulat rules were obtained by analyzing the GENIA corpus. Therefore, it will work best for PubMed articles.
Note that each string in positive and negative_rules represents a regular expression in Perl syntax. In Perl RE, a dot ('.') character is used as a wildcard marker which matches to any single character. To represent a literal dot character, it has to be escaped by a preceding backslash ('') character. When a RE is stored in a string, the backslash character has to be escaped again. That is why some dot characters are double escaped in some rules, e.g. "(Sr|Jr)\\.".
{
// any sequence of whitespace characters with a new line character gets a sentence break
"break_pattern": "([ \t]*\n+)+[ \t]*",
// a space or a tab character, or their any sequence is a candidate of sentence break
"candidate_pattern": "[ \t]+",
"positive_rules": [
["[.!?]", "[0-9A-Z]"],
["[:]", "[0-9]"],
["[:]", "[A-Z][a-z]"]
],
"negative_rules": [
// Titles which usually appear before names
["(Mrs|Mmes|Mr|Messrs|Ms|Prof|Dr|Drs|Rev|Hon|Sen|St)\\.", "[A-Z][a-z]"],
// Titles which sometimes appear before names
["(Sr|Jr)\\.", "[A-Z][a-z]"],
// Abbreviations, e.g. middle names
["\b[A-Z][a-z]*\\.", "[0-9A-Z]"],
// Frequent abbreviations which will never appear in the end of a sentence
["(cf|vs)\\.", ""],
["e\\.g\\.", ""],
["i\\.e\\.", ""],
["(Sec|Chap|Fig|Eq)\\.", "[0-9A-Z]"]
]
}
Below is an example of custom rules, which simply break at every whitespace character which follows a punctuation mark and is followed by a capitalized word.
Note that the two arrays, break_characters and negative_rules are not defined in the example. In the case, the two arrays will be set to be empty.
{
// a sequence with double new line characters is an indicator is sentence break
"break_pattern": "([ \t\n]*\n\n)+[ \t\n]*",
// any whitespace character is a candidate of sentence break
"candidate pattern": "[ \t\n]+",
"positive_rules": [
["[.!?]", "[0-9A-Z][a-z]"]
]
}
Released under the MIT license.