-
Notifications
You must be signed in to change notification settings - Fork 0
Dependency Model
The boy eats a tasty apple
The boys live in a small village
The boy closes the red door
The red door closes
The green door opens
The girl opens the green door
The boy gives a green apple to the girl
The friendly girl gives a red apple to the hungry boy
The girls give the boy an apple
Transformed into conll format:
<s>
The DT the 1 2 det
boy NN boy 2 3 nsubj
eats VBZ eat 3 0 ROOT
a DT a 4 6 det
tasty JJ tasty 5 6 amod
apple NN apple 6 3 dobj
</s>
...
After integration into the typetoken model, we can use the same way to set corpus format,
settings["line-machine"] = "([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)"
settings['line-format'] = 'word,pos,lemma,curId,toId,rel'
settings['type'] = 'lemma'
settings['colloc'] = 'lemma'
settings['token'] = 'lemma/fid/lid'
formatter = CorpusFormatter(settings)
and test for an example line.
line = 'boy NN boy 2 3 nsubj'
match = formatter.match_line(line)
type_ = formatter.get_type(match)
token = formatter.get_token(match, 'fname', '1')
print(type_, ',', token)
# boy , boy/fname/1
Read templates in file and the templates are stored as PathTemplate object.
dephan = DepRelHandler(settings)
path_fname = "/path/to/template.txt"
dephan.read_template(path_fname)
The templates in file look like:
(\w+/NN)\w*:(nsubj):(\w+/V)\w*
(\w+/NN)\w*:(dobj):(\w+/V)\w*
(\w+/NN)\w*:(amod):(\w+/JJ)\w*
(\w+/NN)\w*:(pobj):(\w+/IN)\w*:(prep):(\w+/V)\w*
(\w+/NN)\w*:(nsubj):(\w+/V)\w*:(dobj):(\w+/N.)\w*
(\w+/NN)\w*:(dobj):\w+/(V)\w*:(nsubj):(\w+/N.)\w*
The usage of method build_dep_rel() is the same as make_item_freq() and make_col_freq().
You can either use the selected files by specifying fnames file or process all files in corpus path.
dep_fnames = "/path/to/toy.fnames"
rels = dephan.build_dep_rel(fnames=dep_fnames)
The result matches/relations are stored, for each template, in a list of tuples of words/types.
For example, (\w+/NN)\w*:(nsubj):(\w+/V)\w* has 9 matches (each sentence has a match):
'boy/NN', 'eat/VBZ'
'boy/NNS', 'live/VBP'
'boy/NN', 'close/VBZ'
'door/NN', 'close/VBZ'
'door/NN', 'open/VBZ'
'girl/NN', 'open/VBZ'
'boy/NN', 'give/VBZ'
'girl/NN', 'give/VBZ'
'girl/NNS', 'give/VB'
encow14ax03.xml file is originally 26GB. This corpus file contains 30,804,289 sentences. Since the current model can only utilize multicores for many corpus files, we have to split this large file into smaller ones (each one contain 100,000 sentences).
The following is an example sentence (originally there were tags like <nc> between normal lines):
<s url="http://www.koimag.co.uk/forum/map.php?forum=31%26news%26sid=2efb8c5f26bb593cb2645aa079d902fe%26start=30" id="1c996e6cae4e2c7070afa67d03ebc83a15de" bdc="d" date="Wed, 17 Oct 2012 13:13:25 GMT" last-modified="unknown" country="GB" city="_unk_" bpc="b">
he PP he 1 8 nsubj
told VBD tell 2 8 dep
me PP me 3 2 dobj
of IN of 4 3 prep
a DT a 5 6 det
product NN product 6 4 pobj
he PP he 7 8 nsubj
uses VBZ use 8 0 null
to TO to 9 10 aux
control VB control 10 8 xcomp
the DT the 11 12 det
weed NN weed 12 10 dobj
, , , 13 8 punct
its PP$ its 14 15 amod
name NN name 15 18 nsubj
is VBZ be 16 18 cop
" `` " 17 18 punct
QUARTS JJ (unknown) 18 8 conj
" '' " 19 18 punct
and CC and 20 8 cc
the DT the 21 23 det
uk NP Uk 22 23 amod
company NN company 23 24 nsubj
is VBZ be 24 8 conj
, , , 25 24 punct
" `` " 26 24 punct
HYDRA NP (unknown) 27 29 dep
" '' " 28 27 punct
. SENT . 29 24 punct
</s>
Number of matches for each template:
-
(\w+/NN)\w*:(nsubj):(\w+/V)\w*: 1,076,066 -
(\w+/NN)\w*:(dobj):(\w+/V)\w*: 1,484,833 -
(\w+/NN)\w*:(amod):(\w+/JJ)\w*: 234,219 -
(\w+/NN)\w*:(pobj):(\w+/IN)\w*:(prep):(\w+/V)\w*: 305,819 -
(\w+/NN)\w*:(nsubj):(\w+/V)\w*:(dobj):(\w+/N.)\w*: 2,309,336 -
(\w+/NN)\w*:(dobj):\w+/(V)\w*:(nsubj):(\w+/N.)\w*: 1,415,571
Time cost on r2d2 (with 23 cores) is around 1 hour. The time complexity is not exactly proportional to the size of the corpus but proportional to the number of sentences.
Output file size is around 280MB for directly saving Python object. Improvements would be 1) change the type of words from string to integer, 2) saving as json format.
- Vocab: represent the vocabulary of a corpus and wrap a python dict of string word to integer indices
- TypeTokenMatrix: wrapper matrix class with the matrix object represented by numpy array or scipy sparse matrix, and row item list and column item list
- Multicore template workflow for processing corpus files: procedures of counting frequency list, collocation matrix, tokens have similar template workflow
(\w+/NN)\w*:(nsubj):(\w+/V)\w* has a match for the first toy sentence: 'boy/NN':(pobj):'eat/VBZ'.
And (\w+/NN)\w*:(pobj):(\w+/IN)\w*:(prep):(\w+/V)\w* has a match for the first toy sentence: 'boy/NN':(pobj):'eat/VBZ':(prep)'apple/NN'.
Given a target word eat/VBZ, get all path matches that include such target word. Then for our toy example, we have:
'boy/NN':(nsubj):'eat/VBZ'
'apple/NN':(dobj):'eat/VBZ'
'boy/NN':(nsubj):'eat/VBZ':(dobj):'apple/NN'
Then eat/VBZ has context features 'boy/NN':(nsubj):*, 'apple/NN':(dobj):*, 'boy/NN':(nsubj):*:(dobj):'apple/NN'.
Current dependency model can only process file context which is separated into sentences.
Is there any corpus without sentence separation that would use the dependency model? How to deal with this problem?