Text-To-Relations: a tool for Information and Relation Extraction
First, Text-To-Relations lets you perform entity recognition by providing a simple interface for creating complex regular expressions.
- If you regularly find yourself writing complex regular expressions, the project can be used as a utility for quickly writing a first draft and easily testing it out.
Then you can combine these entities to form relations with Text-To-Relations's abstract base class and the working example provided.
Text-To-Relations has been tested on:
- Python 3.9.18 and Python 3.11.6 on MacOS Sequoia 15.2
- Python 3.10.12 on Ubuntu 22
We expect it to work without issue in many other environments as well.
Source code at GitHub text_to_relations.
After setting up your virtual environment, install Text-To-Relations with this Pip command.
pip install text_to_relations
If your Python version is less than 3.11, upgrade pip and install typing_extensions:
pip install --upgrade pip
pip install typing_extensions
Text-To-Relations also requires Spacy. Install the Spacy dependencies with these commands. As far as we know, Text-To-Relations is agnostic with regard to the specific Spacy version.
pip install -U spacy
Text-to-Relations has been tested with the en_core_web_lg
Spacy English language model. Install it in this way:
python -m spacy download en_core_web_lg
If you already have the en_core_web_trf
English language model installed, you may be able to use that instead of en_core_web_lg
under Python 3.10 and 3.11. However, it will not work under Python 3.9.
For this example we're going to write rules to extract postage stamp descriptions as found at https://www.mysticstamp.com/.
Here is Python code to create an input string containing the descriptions for several 19th Century U.S. postage stamps.
import inspect
from text_to_relations.relation_extraction.RegexString import RegexString
input = \
"""
# 11A - 1853-55 3¢ George Washington, dull red, type II, imperf
# 17 - 1851 12c Washington imperforate, black
# 12 - 1856 5c Jefferson, red brown, type I, imperforate
# 18 - 1861 1c Franklin, type I, perf 15
# 40 - 1875 1c Franklin, bright blue
# 42 - 1875 5c Jefferson, orange brown
# 62B - 1861 10c Washington, dark green
input = inspect.cleandoc(input)
"""
We'll break the first description up into its constituent parts:
- id: # 11A
- year_issued: 1853-55
- value: 3
- denomination: cents
- theme: George Washington
- color: dull red
- type: II
- perforation_info: imperforate
Let's pretend that our goal is to write rules to successfully process all the data at the Mystic Stamp website. We hope that writing rules for these examples will give us a good start.
With Text-To-Relations, the go-to tool is RegexString, which is essentially a wrapper around your basic regular expression. The wrapper lets you construct highly useful, working regular expressions with little or no fuss--and requiring from you a minimum amount of understanding the arcane details of regular expressions.
The simplest interface is RegexString's constructor, which with default parameters just takes a list of strings which you want to be considered "synonymous" in some way, that is, as different examples of the same kind of thing.
We'll use the type entity to begin our demonstration of RegexString. In the text above you see that there are only two--"type I" and "type II". But let's suppose that, in the complete data on the website:
- First, that sometimes the word type is capitalized and sometimes it is all-lowercase.
- Second, that the types run from Roman numeral I through V.
type_markers = ['type', 'Type']
type_rs = RegexString(type_markers, whole_word=True)
print(f"Regular expression for 'type_rs': {type_rs.get_regex_str()}\n")
roman_numerals = ['I', 'II', 'III', 'IV', 'V']
roman_nums_rs = RegexString(roman_numerals, whole_word=True)
print(f"Regular expression for 'roman_nums_rs': {roman_nums_rs.get_regex_str()}")
If you ran the code above you would see this as output:
Regular expression for 'type_rs': \b(?:type|Type)\b
Regular expression for 'roman_nums_rs': \b(?:III|II|IV|I|V)\b
Both "\btype\b" and "\b(?:III|II|IV|I|V)\b" are valid regular expressions that you could use with the Python re
regular expression library, for example re.finditer()
or any other one of the library's functions.
Both regular expressions begin and end with '\b,' which is regex-speak for "whole word only." This is controlled by the whole_word
parameter, whose default value is False.
The meat of the regex for roman_nums_rs
consists of the five Roman numerals separated by "|" (meaning OR), which will match on any one of the five if found in a piece of text. Some other things to note here:
- The regular expression sorts the OR'd numerals by length so that, for example, if "II" occurs in the text you will get a single match on the entire string rather than two matches on "I." (Note that this is only an issue when
whole_word
is set to False.) - The scary looking "?:" at the start of the block tells the regular expression matcher not to perform grouping on the match. If you don't know what this means--don't worry, you don't need to know. For most natural language processing tasks (as opposed to some text processing tasks), you don't need grouping. Text-To-Relations employs non-capturing by default so that some of the other functionality that builds on RegexString objects can assume the associated regular expression is non-capturing. (If you must have capturing, create the RegeString with
non_capturing=False
to allow capturing.)
Finally we're ready to run our RegexString objects against the stamp descriptions at top.
type_phrase_rs = RegexString.concat_with_word_distances(type_rs,
roman_nums_rs,
min_nbr_words=0,
max_nbr_words=0)
matchStrs = type_phrase_rs.get_match_triples(input)
print("\nType info found in the input:")
print(matchStrs)
RegexString.concat_with_word_distances() takes two RegexString objects and creates a new RegexString which will match text satisfying both of the original regular expressions when they are within a certain amount of word distance. In our case we are saying that we don't want to allow any words between them. (Note that this is a static function belonging to the RegexString class rather than to any particular RegexSting object.)
When we run this code we see this output:
Type info found in the input:
[('type II', 48, 55), ('type I', 149, 155), ('type I', 195, 201)]
get_match_triples( ) runs a MatchString object's regular expression against user-supplied text input. The "triples" it returns are: (1) the matched string; (2) the matched string's starting offset in the input text; and (3) the matched string's ending offset in the input text.
You see that the type_phrase_rs
object which we constructed from smaller pieces managed to find all three "type" expressions in the postage stamp input.
Now let's extract the value of the postage stamp, which will give us a chance to demo RegexString's prepend
parameter. Our description above has six unique values, only one of which uses the old-fashioned cents character, '¢'. These are: 3¢, 12c, 5c, 1c, 1c, 5c, 10c.
cent_symbols = ['c', '¢']
# Use `prepend` to look for one or two digits before the cent symbol.
cent_rs = RegexString(cent_symbols, prepend='\d\d?')
print(f"\nRegular expression for 'cent_rs': {cent_rs.get_regex_str()}")
matchStrs = cent_rs.get_match_triples(input)
print("\nCents info found in the input:")
print(matchStrs)
This outputs:
Regular expression for 'cent_rs': \d\d?(?:c|¢)
Cents info found in the input:
[('3¢', 16, 18), ('12c', 77, 80), ('5c', 124, 126), ('1c', 182, 184), ('1c', 224, 226), ('5c', 262, 264), ('10c', 303, 306)]
As a side note, Text-to-Relations offers at least one other way of accomplishing the same goal: we could have created two RegexString objects and then concatenated them with the RegexString concat( ) function.
Here's one more "Introduction-to-RegexString" example, this time used to extract the perforation information in our postage stamp descriptions. The specific perforation examples we're targeting are: imperf, imperforate, perf 15.
perforation_markers = ['perf', 'imperforate', 'imperf']
perforations_rs = RegexString(perforation_markers)
matchStrs = perforations_rs.get_match_triples(input)
print("\nPerforation info:")
print(matchStrs)
When run, the code above gives us:
Perforation info:
[('imperf', 57, 63), ('imperforate', 92, 103), ('imperforate', 157, 168), ('perf', 203, 207)]
But this isn't quite what we want. If perforated, the description includes the perforation size. Let's try again.
imperforated_markers = ['imperforate', 'imperf']
imperf_rs = RegexString(imperforated_markers)
matchStrs = imperf_rs.get_match_triples(input)
print("\nImperforated:")
print(matchStrs)
perforation_markers = ['perf']
perf_rs = RegexString(perforation_markers, append=' \d\d')
matchStrs = perf_rs.get_match_triples(input)
print("\nPerforated sizes:")
print(matchStrs)
What this produces is much closer to what we expect:
Imperforated:
[('imperf', 57, 63), ('imperforate', 92, 103), ('imperforate', 157, 168)]
Perforated sizes:
[('perf 15', 203, 210)]
Now for some new RegexString functionality. First we'll use the function build_regex_string( ) to create a regular expression that matches on one set of list items on one word, then on another set for the next word in the text--this seems right for colors that are preceded by qualifiers like bright and dark.
Before going further let's have a look at the use of colors in the stamp descriptions--one description doesn't mention a color, one has just "black," and five have a qualifier + color description. The specific strings we're targeting are: dull red, black, red brown, bright blue, orange brown, dark green.
Now let's get started.
colors = ['black', 'blue', 'brown', 'green', 'orange', 'red']
colors_rs = RegexString(colors, whole_word=True)
matchStrs = colors_rs.get_match_triples(input)
print("\nColors alone:")
print(matchStrs)
# Create the qualifiers.
# Add colors to the qualifiers, so that we can ID, e.g. 'orange brown'.
color_qualifiers = ['bright', 'dark', 'dull'] + colors
inputList = [color_qualifiers, 0, colors]
color_phrase_rs = RegexString.build_regex_string(inputList)
matchStrs = color_phrase_rs.get_match_triples(input)
print("\nColor qualifiers + colors:")
print(matchStrs)
Output:
Colors alone:
[('red', 43, 46), ('black', 105, 110), ('red', 138, 141), ('brown', 142, 147), ('blue', 244, 248), ('orange', 276, 282), ('brown', 283, 288), ('green', 324, 329)]
Color qualifiers + colors:
[('dull red', 38, 46), ('red brown', 138, 147), ('bright blue', 237, 248), ('orange brown', 276, 288), ('dark green', 319, 329)]
Oops! We're not getting the colors which aren't preceded by a qualifier. The build_regex_string( ) function may be useful in other circumstances, but it's not the right tool for this particular job. (In truth, I knew this function wouldn't work here before I tried it out, but I wanted to introduce you to the function and I had no other use case for this set of postage stamp descriptions.)
Instead, let's try concat_with_word_distances( ), which we've already used successfully to identify type + roman numeral, a context where the first word was not optional.
color_qualifiers_rs = RegexString(color_qualifiers,
whole_word=True,
optional=True
)
color_phrase_rs = RegexString.concat_with_word_distances(color_qualifiers_rs,
colors_rs,
min_nbr_words=0,
max_nbr_words=0)
matchStrs = color_phrase_rs.get_match_triples(input)
print("\nOptional color qualifiers + colors found in the input:")
print(matchStrs)
Final output:
Optional color qualifiers + colors found in the input:
[('dull red', 38, 46), ('black', 105, 110), ('red brown', 138, 147), ('bright blue', 237, 248), ('orange brown', 276, 288), ('dark green', 319, 329)]
For our final example we're going to extract the Mystic Stamp Company's IDs for the postage stamps in our description. These are: # 11A, # 17, # 12, # 18, # 40, # 42, # 62B.
Our quick first pass uses RegexString with the append
parameter, which completes the regular expression by matching on an empty space followed by one or more digits.
id_symbols = ['#']
id_rs = RegexString(id_symbols, append='\s\d+')
matchStrs = id_rs.get_match_triples(input)
print("\nStamp IDs:")
print(matchStrs)
We see that this gets the digit part of the IDs, but not the capital letter which follows two of them.
Stamp IDs:
[('# 11', 0, 4), ('# 17', 65, 69), ('# 12', 112, 116), ('# 18', 170, 174), ('# 40', 212, 216), ('# 42', 250, 254), ('# 62', 290, 294)]
No problem. With our limited understanding of regular expressions, we know that we can add (\w)? to the end of the regular expression in order to match on an optional word character. (A word character excludes digits and punctuation.)
- If we use \w+ instead of just \w, we could match on additional word characters, in case some of the IDs which we haven't seen have these extra characters.
- And from the talk above regarding non-grouping expressions, we have learned that, generally speaking, we should insert ?: at the start of the optional word character parentheses.
id_rs = RegexString(id_symbols, append='\s\d+(?:\w+)?')
matchStrs = id_rs.get_match_triples(input)
print("\nStamp IDs with letters found in the input:")
print(matchStrs)
Stamp IDs with letters found in the input:
[('# 11A', 0, 5), ('# 17', 65, 69), ('# 12', 112, 116), ('# 18', 170, 174), ('# 40', 212, 216), ('# 42', 250, 254), ('# 62B', 290, 295)]
By relation extraction we mean the identification of relationships between multiple entities. So, to use our postage stamp exercise as an example, creating a StampDescription class with properties such as type, value and perforation_info would qualify as relation extraction.
I have not done that, however. What I have done--and what I can offer as an example of relation extraction--is relation_extraction/extract_min_max_relation.py
. You can find it in the source code at the GitHub Text-to-Relations project.
The Python code there takes you all the way from using RegexString to extract entities to subclassing the ExtractionPhaseABC abstract base class in order to combine those entities into a MinMax object based on text snippets like:
- within the range of 60 to 90 points
- between 170 and 220 pounds.
- 30 to 40 drinks
The process begins by using RegexString to identify these entities:
- Number (for integers)
- Unit_of_Measure on the words points, pounds and drinks.
The process then passes these entities (as objects of the Text-to-Relations Annotation class) to two phases.
- Phase 1 creates a new temporary annotation called RangeMarker on phrases like within the range of and between. Then it looks for a RangeMarker + Number + Number + Unit_of_Measure in close proximity. If any are found, it creates a new MinMax annotation.
- Phase 2 creates a new temporary annotation called ToMarker on to and - (the hyphen character). Then it looks for a Number + ToMarker + Number + Unit_of_Measure in close proximity. If any are found, it creates a new MinMax annotation.
There is a critical snippet of code which is executed after Phase 1 and before Phase 2. You can find this in the run_extraction_phases( ) function of extract_min_max_relation.py
. It is designed to prevent Phase 2 from matching on any of the Number and Unit_of_Measure entities which were used to create MinMax relations in Phase 1, and it accomplishes this by removing from the annotation list all Number and Unit_of_Measure annotations which are enclosed by the MinMax relations extracted in that phase.
SentenceAnn: There is a class called SentenceAnn which subclasses Annotation. Not used in the MinMax extraction discussed here, it could be useful (in combination with Annotation.encloses( )) if you wanted to require that relations occur within a single sentence. Naturally, to succeed here the Spacy sentence-splitting, which is used here, must be correct.