Morphological dictionary and multi-word tokens #99

jeanm · 2019-06-05T11:30:11Z

(First of all, congrats on UDPipe, it's a pleasure to use!)

I've built a morphological generator for an endangered language, and I'm having it save its output in the tab-separated FORM,LEMMA,UPOS,XPOS,FEATS format so that I can also use it with UDPipe.

Is there any way of supporting multi-word tokens in that format, such that UDPipe will take them into account? I am talking e.g. for French about a way to specify that au should be split and tagged like so:

1-2	au	_	_	_	_
1	à	à	ADP	ADP	_
2	le	le	DET	DET	Definite=Def\|Gender=Masc\|Number=Sing\|PronType=Art

If not, I will probably need to extend this format on my own. Do you have any suggestions for a way to do this which could be backwards compatible with the format used by UDPipe?

I was thinking of something like the following:

au   _   _   _   _   SplitForm=à/le|SplitLemma=à/le|SplitUPos=ADP/DET|SplitFeats=_/Definite=Def,Gender=Masc,Number=Sing,PronType=Art

It does seem awfully verbose though...

The text was updated successfully, but these errors were encountered:

foxik · 2019-06-13T09:08:14Z

Currently that would be non-trivial to do (just because how the implementation works).

There are two parts of the mentioned problem:

The au multi-word token must be split in two words à and le. Currently UDPipe does that in a very old-fashioned way by having a dictionary with rules how the multi-word tokens are split. It would not be difficult to allow adding additional rules, both during training or during inference.
Run morphological analysis on the resulting words. UDPipe currently does not distinguish tokens and multi-word tokens, so the analyses for à are the same independently whether it was a token or a part of a multi-word token -- but of course it could be modified.

I have no suggestions to how the dictionary should look like -- in future, I would prefer to allow specifying morphological analyses for words themselves (so any morphological analysis system can be used, not just a flat plain text file), so I am not planning to extending the flat morphological file myself.

ftyers · 2020-05-22T21:30:38Z

There are some other issues that relate to this: #63 and #50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Morphological dictionary and multi-word tokens #99

Morphological dictionary and multi-word tokens #99

jeanm commented Jun 5, 2019 •

edited

Loading

foxik commented Jun 13, 2019

ftyers commented May 22, 2020

Morphological dictionary and multi-word tokens #99

Morphological dictionary and multi-word tokens #99

Comments

jeanm commented Jun 5, 2019 • edited Loading

foxik commented Jun 13, 2019

ftyers commented May 22, 2020

jeanm commented Jun 5, 2019 •

edited

Loading