Skip to content

Dependency Model

Tao Chen edited this page Feb 8, 2019 · 9 revisions

Toy Example

Sentences

The boy eats a tasty apple
The boys live in a small village
The boy closes the red door
The red door closes
The green door opens
The girl opens the green door
The boy gives a green apple to the girl
The friendly girl gives a red apple to the hungry boy
The girls give the boy an apple

Transformed into conll format:

<s>
The	DT	the	1	2	det
boy	NN	boy	2	3	nsubj
eats	VBZ	eat	3	0	ROOT
a	DT	a	4	6	det
tasty	JJ	tasty	5	6	amod
apple	NN	apple	6	3	dobj
</s>
...

Process procedure

Set corpus format

After integration into the typetoken model, we can use the same way to set corpus format,

settings["line-machine"] = "([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)"
settings['line-format'] = 'word,pos,lemma,curId,toId,rel'
settings['type'] = 'lemma'
settings['colloc'] = 'lemma'
settings['token'] = 'lemma/fid/lid'
formatter = CorpusFormatter(settings)

and test for an example line.

line = 'boy	NN	boy	2	3	nsubj'
match = formatter.match_line(line)
type_ = formatter.get_type(match)
token = formatter.get_token(match, 'fname', '1')
print(type_, ',', token)
# boy , boy/fname/1

Read templates

Read templates in file and the templates are stored as PathTemplate object.

dephan = DepRelHandler(settings)
path_fname = "/path/to/template.txt"
dephan.read_template(path_fname)

The templates in file look like:

(\w+/NN)\w*:(nsubj):(\w+/V)\w*
(\w+/NN)\w*:(dobj):(\w+/V)\w*
(\w+/NN)\w*:(amod):(\w+/JJ)\w*
(\w+/NN)\w*:(pobj):(\w+/IN)\w*:(prep):(\w+/V)\w*
(\w+/NN)\w*:(nsubj):(\w+/V)\w*:(dobj):(\w+/N.)\w*
(\w+/NN)\w*:(dobj):\w+/(V)\w*:(nsubj):(\w+/N.)\w*

Match sentences

The usage of method build_dep_rel() is the same as make_item_freq() and make_col_freq().
You can either use the selected files by specifying fnames file or process all files in corpus path.

dep_fnames = "/path/to/toy.fnames"
rels = dephan.build_dep_rel(fnames=dep_fnames)

Result

The result matches/relations are stored, for each template, in a list of tuples of words/types.
For example, (\w+/NN)\w*:(nsubj):(\w+/V)\w* has 9 matches (each sentence has a match):

'boy/NN', 'eat/VBZ'
'boy/NNS', 'live/VBP'
'boy/NN', 'close/VBZ'
'door/NN', 'close/VBZ'
'door/NN', 'open/VBZ'
'girl/NN', 'open/VBZ'
'boy/NN', 'give/VBZ'
'girl/NN', 'give/VBZ'
'girl/NNS', 'give/VB'

A large example

encow14ax03.xml file is originally 26GB. This corpus file contains 30,804,289 sentences. Since the current model can only utilize multicores for many corpus files, we have to split this large file into smaller ones (each one contain 100,000 sentences).

The following is an example sentence (originally there were tags like <nc> between normal lines):

<s url="http://www.koimag.co.uk/forum/map.php?forum=31%26news%26sid=2efb8c5f26bb593cb2645aa079d902fe%26start=30" id="1c996e6cae4e2c7070afa67d03ebc83a15de" bdc="d" date="Wed, 17 Oct 2012 13:13:25 GMT" last-modified="unknown" country="GB" city="_unk_" bpc="b">
he      PP      he      1       8       nsubj
told    VBD     tell    2       8       dep
me      PP      me      3       2       dobj
of      IN      of      4       3       prep
a       DT      a       5       6       det
product NN      product 6       4       pobj
he      PP      he      7       8       nsubj
uses    VBZ     use     8       0       null
to      TO      to      9       10      aux
control VB      control 10      8       xcomp
the     DT      the     11      12      det
weed    NN      weed    12      10      dobj
,       ,       ,       13      8       punct
its     PP$     its     14      15      amod
name    NN      name    15      18      nsubj
is      VBZ     be      16      18      cop
&quot;  ``      &quot;  17      18      punct
QUARTS  JJ      (unknown)       18      8       conj
&quot;  &apos;&apos;    &quot;  19      18      punct
and     CC      and     20      8       cc
the     DT      the     21      23      det
uk      NP      Uk      22      23      amod
company NN      company 23      24      nsubj
is      VBZ     be      24      8       conj
,       ,       ,       25      24      punct
&quot;  ``      &quot;  26      24      punct
HYDRA   NP      (unknown)       27      29      dep
&quot;  &apos;&apos;    &quot;  28      27      punct
.       SENT    .       29      24      punct
</s>

Number of matches for each template:

  • (\w+/NN)\w*:(nsubj):(\w+/V)\w*: 1,076,066
  • (\w+/NN)\w*:(dobj):(\w+/V)\w*: 1,484,833
  • (\w+/NN)\w*:(amod):(\w+/JJ)\w*: 234,219
  • (\w+/NN)\w*:(pobj):(\w+/IN)\w*:(prep):(\w+/V)\w*: 305,819
  • (\w+/NN)\w*:(nsubj):(\w+/V)\w*:(dobj):(\w+/N.)\w*: 2,309,336
  • (\w+/NN)\w*:(dobj):\w+/(V)\w*:(nsubj):(\w+/N.)\w*: 1,415,571

Time cost on r2d2 (with 23 cores) is around 1 hour. The time complexity is not exactly proportional to the size of the corpus but proportional to the number of sentences.

Output file size is around 280MB for directly saving Python object. Improvements would be 1) change the type of words from string to integer, 2) saving as json format.

Integration

qlvl

  • Vocab: represent the vocabulary of a corpus and wrap a python dict of string word to integer indices
  • TypeTokenMatrix: wrapper matrix class with the matrix object represented by numpy array or scipy sparse matrix, and row item list and column item list
  • Multicore template workflow for processing corpus files: procedures of counting frequency list, collocation matrix, tokens have similar template workflow

TODO

Transform matches to co-ocurrence matrix

(\w+/NN)\w*:(nsubj):(\w+/V)\w* has a match for the first toy sentence: 'boy/NN':(pobj):'eat/VBZ'.
And (\w+/NN)\w*:(pobj):(\w+/IN)\w*:(prep):(\w+/V)\w* has a match for the first toy sentence: 'boy/NN':(pobj):'eat/VBZ':(prep)'apple/NN'.

Given a target word eat/VBZ, get all path matches that include such target word. Then for our toy example, we have:

'boy/NN':(nsubj):'eat/VBZ'
'apple/NN':(dobj):'eat/VBZ'
'boy/NN':(nsubj):'eat/VBZ':(dobj):'apple/NN'

Then eat/VBZ has context features 'boy/NN':(nsubj):*, 'apple/NN':(dobj):*, 'boy/NN':(nsubj):*:(dobj):'apple/NN'.

Corpus format

Current dependency model can only process file context which is separated into sentences.
Is there any corpus without sentence separation that would use the dependency model? How to deal with this problem?

Clone this wiki locally