Development of tools for Sanskrit #12085

kleczekr · 2023-01-10T15:13:51Z

kleczekr
Jan 10, 2023

Hi!

I wanted to ask, how's the development of tools for Sanskrit language going? I've noticed a discussion about it some time ago, but I'm not clear on the specific tasks and the state of development.

Theoretically, would it be difficult to train a model for Sanskrit, using e.g. Oliver Hellwig's .connlu files for the Digital Corpus of Sanskrit (https://github.com/OliverHellwig/sanskrit/tree/master/dcs/data/conllu)? I'd love to experiment with that, but I frankly don't know how difficult it is to get started---plus, if someone's already doing it, we could collaborate on it.

Best,
Raf

richardpaulhudson · 2023-01-20T11:03:21Z

richardpaulhudson
Jan 20, 2023

Hi Raf,

thanks for your question and sorry for taking so long to get back to you! At present we offer basic support for Sanskrit, meaning that although we don't yet distribute models, if you set the language of your pipeline to Sanskrit then you'll get a properly configured tokenizer, stopwords, maybe some lexical attributes (like flags for punctuation or numeric terms), etc. That's all the relevant discussion/activity I know about, although of course there may be other work I'm not aware of.

The corpus you link to looks like an excellent starting point for developing a spaCy model. The spacy convert command enables the conversion of .conllu data into spaCy's native binary format, so presuming that this corpus really conforms to the .conllu standard, conversion should be very simple. (In practice you may need to make a few small modifications to get everything working together, but that shouldn't be that hard either). The corpus' license also looks as though it would allow us to publish any model produced using it. Two other things you'd need to consider before feeding the native binary data into spacy train (see also #3056) are:

splitting the corpus into separate training / development / test sets;
whether there's an easy way of converting the Romanised text into the corpus into the script used in the existing basic support (which I presume is Devanagari?). If conversion isn't possible you could also add basic support for Romanised Sanskrit as a separate language (using e.g. sa_ro as the directory name) but it would obviously be better if we could avoid this.

Please let us know if you have any further questions or need anything to get started.

6 replies

kleczekr Jan 30, 2023
Author

Output from training the model:

adrianeboyd Feb 1, 2023

Looking at a few of the training files, a large part of the low performance is probably related to tokenization, in particular for the multi-word tokens. In spacy, the tokens in the doc are linked directly to individual characters in the text, the built-in tokenizer mainly splits on whitespace and punctuation, and there isn't any support for multi-word token expansion (as in https://stanfordnlp.github.io/stanza/mwt.html).

Either using blank:xx with your language code or using the model-best that you trained, you can check the tokenization with spacy evaluate model-best train.spacy -o metrics.json and check the scores that start with token_ in the JSON output. My guess is that token_f will be low. In most cases it doesn't make sense to start training other pipeline components until it's well over 0.90, preferably closer to 0.99.

If you want to have the expanded multi-word tokens in your output, you can consider other tools like udpipe or stanza instead of spacy. A related comment: #6648 (comment)

So I think spacy isn't going to make sense for this, but if you do want to keep working with spacy:

If the spacy Doc has the text athātaḥ the only tokenization options would be character spans like athā taḥ or ath ātaḥ, not the expanded versions, which may not be what you want. Even if you had a tokenizer that could split athātaḥ into atha ataḥ, there's no good way to store two different tokenization layers in a spacy Doc object. It's just not designed to support this.

For UD corpora with a small number of multi-word tokens that are primarily contractions (like "can't" in English or "du" in French), you can use spacy convert --merge-subtokens to merge this annotation and train from the merged tokens rather than the split ones. The merged form of the token is a single token in the doc and the other annotation layers are merged as well as possible (merged MORPH, combined UPOS as tags like ADP_DET, annotation of the head used for other attributes). You could try this and see if it's okay for this data / your task, but it sounds like this probably isn't what you're looking for?

kleczekr Feb 22, 2023
Author

Hi Adriane!

Many thanks for your elaborate answer. If you don't mind, I have a few follow-up questions.

I have come across a Sanskrit segmenter which is very promising. Assuming that I have a segmented text, I guess that I should be able to switch to spaCy for further processing, right? If so, what would be the optimum way of training a model for that, aimed specifically at POS recognition?

As I'm looking through Oliver Hellwig's corpus, I have the impression that it might not be very typical, as it lists not only individual tokens, but also tokens within compounds. Here for example is the sentence I have quoted above:

# text = athātaḥ paśupateḥ pāśupataṃ yogavidhiṃ vyākhyāsyāmaḥ
# sent_id = 81997
# sent_counter = 1
# sent_subcounter = 
1-2	athātaḥ	_	_	_	_	_	_	_	_
1	atha	atha	ADV	_		_	_	_	LemmaId=7467|OccId=3410678|Unsandhied=atha|UnsandhiedReconstructed=True
2	atas	atas	ADV	_		_	_	_	LemmaId=6823|OccId=3410679|Unsandhied=atas|UnsandhiedReconstructed=True
3	paśupateḥ	paśupati	NOUN	_	Case=Gen|Gender=Masc|Number=Sing	_	_	_	LemmaId=92719|OccId=3410680|Unsandhied=paśupateḥ|UnsandhiedReconstructed=True
4	pāśupataṃ	pāśupata	ADJ	_	Case=Acc|Gender=Masc|Number=Sing	_	_	_	LemmaId=94540|OccId=3410681|Unsandhied=pāśupatam|UnsandhiedReconstructed=True
5-6	yogavidhiṃ	_	_	_	_	_	_	_	_
5	yoga	yoga	NOUN	_	Case=Cpd	_	_	_	LemmaId=65008|OccId=3410682|Unsandhied=yoga|UnsandhiedReconstructed=True
6	vidhim	vidhi	NOUN	_	Case=Acc|Gender=Masc|Number=Sing	_	_	_	LemmaId=122846|OccId=3410683|Unsandhied=vidhim|UnsandhiedReconstructed=True
7	vyākhyāsyāmaḥ	vyākhyā	VERB	_	Tense=Fut|Mood=Ind|Person=1|Number=Plur	_	_	_	LemmaId=164616|OccId=3410684|Unsandhied=vyākhyāsyāmaḥ

Tokens 1 and 2 are first presented in the non-split form as 1-2, similar tokens 5 and 6. Do you think it would boost efficiency of a model if I would remove instances such as the 1-2 and 5-6 lines (the compounds taken as wholes) from the training data, since these would anyway not occur in an already segmented text?

adrianeboyd Feb 28, 2023

If you have a separate custom tokenizer that can turn the input text athātaḥ into the two tokens atha + atas, you can train on the split tokens rather than the merged surface form.

spacy convert file.conllu will include tokens 1 and 2 as separate tokens and won't include the merged 1-2 token.

spacy convert --merge-subtokens file.conllu will include the merged token 1-2 rather than the separate tokens. The converter tries to merge the other types of annotation from the 1 / 2 token lines as much as possible.

If you'd like to try out the converter with some existing UD corpora, a good example is multi-word tokens like au / à le and des / de les in French with UD French Sequoia.

So: you can do this either way, but your tokenizer settings need to match your spacy convert settings so that the tokenizer is producing the same tokens that are in your training docs.

spacy's built-in regex-based tokenizer never modifies the underlying text and can only work with the version with merged subtokens.

ivan-kleshnin Feb 12, 2025

spacy's built-in regex-based tokenizer never modifies the underlying text and can only work with the version with merged subtokens

Spacy EN tokenizer splits I'am, You're, etc. into separate tokens. So, at least for EN, it might be necessary to prevent token merging instead. For UD-EWT example:

https://github.com/explosion/projects/blob/v3/pipelines/tagger_parser_ud/project.yml#L71

python -m spacy convert
...
- --merge-subtokens

I wonder why an official example proposes incorrect/insufficient defaults. Perhaps the idea was to update the tokenizer in Spacy or, somehow, to post-merge tokens (but nlp.add_pipe("merge_subtokens") never worked for me).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development of tools for Sanskrit #12085

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Development of tools for Sanskrit #12085

kleczekr Jan 10, 2023

Replies: 1 comment · 6 replies

richardpaulhudson Jan 20, 2023

kleczekr Jan 30, 2023 Author

adrianeboyd Feb 1, 2023

kleczekr Feb 22, 2023 Author

adrianeboyd Feb 28, 2023

ivan-kleshnin Feb 12, 2025

kleczekr
Jan 10, 2023

Replies: 1 comment 6 replies

richardpaulhudson
Jan 20, 2023

kleczekr Jan 30, 2023
Author

kleczekr Feb 22, 2023
Author