You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(First of all, congrats on UDPipe, it's a pleasure to use!)
I've built a morphological generator for an endangered language, and I'm having it save its output in the tab-separated FORM,LEMMA,UPOS,XPOS,FEATS format so that I can also use it with UDPipe.
Is there any way of supporting multi-word tokens in that format, such that UDPipe will take them into account? I am talking e.g. for French about a way to specify that au should be split and tagged like so:
1-2
au
_
_
_
_
1
à
à
ADP
ADP
_
2
le
le
DET
DET
Definite=Def|Gender=Masc|Number=Sing|PronType=Art
If not, I will probably need to extend this format on my own. Do you have any suggestions for a way to do this which could be backwards compatible with the format used by UDPipe?
I was thinking of something like the following:
au _ _ _ _ SplitForm=à/le|SplitLemma=à/le|SplitUPos=ADP/DET|SplitFeats=_/Definite=Def,Gender=Masc,Number=Sing,PronType=Art
It does seem awfully verbose though...
The text was updated successfully, but these errors were encountered:
Currently that would be non-trivial to do (just because how the implementation works).
There are two parts of the mentioned problem:
The au multi-word token must be split in two words à and le. Currently UDPipe does that in a very old-fashioned way by having a dictionary with rules how the multi-word tokens are split. It would not be difficult to allow adding additional rules, both during training or during inference.
Run morphological analysis on the resulting words. UDPipe currently does not distinguish tokens and multi-word tokens, so the analyses for à are the same independently whether it was a token or a part of a multi-word token -- but of course it could be modified.
I have no suggestions to how the dictionary should look like -- in future, I would prefer to allow specifying morphological analyses for words themselves (so any morphological analysis system can be used, not just a flat plain text file), so I am not planning to extending the flat morphological file myself.
(First of all, congrats on UDPipe, it's a pleasure to use!)
I've built a morphological generator for an endangered language, and I'm having it save its output in the tab-separated
FORM,LEMMA,UPOS,XPOS,FEATS
format so that I can also use it with UDPipe.Is there any way of supporting multi-word tokens in that format, such that UDPipe will take them into account? I am talking e.g. for French about a way to specify that au should be split and tagged like so:
If not, I will probably need to extend this format on my own. Do you have any suggestions for a way to do this which could be backwards compatible with the format used by UDPipe?
I was thinking of something like the following:
It does seem awfully verbose though...
The text was updated successfully, but these errors were encountered: