Development of tools for Sanskrit #12085
Replies: 1 comment 6 replies
-
Hi Raf, thanks for your question and sorry for taking so long to get back to you! At present we offer basic support for Sanskrit, meaning that although we don't yet distribute models, if you set the language of your pipeline to Sanskrit then you'll get a properly configured tokenizer, stopwords, maybe some lexical attributes (like flags for punctuation or numeric terms), etc. That's all the relevant discussion/activity I know about, although of course there may be other work I'm not aware of. The corpus you link to looks like an excellent starting point for developing a spaCy model. The spacy convert command enables the conversion of .conllu data into spaCy's native binary format, so presuming that this corpus really conforms to the .conllu standard, conversion should be very simple. (In practice you may need to make a few small modifications to get everything working together, but that shouldn't be that hard either). The corpus' license also looks as though it would allow us to publish any model produced using it. Two other things you'd need to consider before feeding the native binary data into spacy train (see also #3056) are:
Please let us know if you have any further questions or need anything to get started. |
Beta Was this translation helpful? Give feedback.
-
Hi!
I wanted to ask, how's the development of tools for Sanskrit language going? I've noticed a discussion about it some time ago, but I'm not clear on the specific tasks and the state of development.
Theoretically, would it be difficult to train a model for Sanskrit, using e.g. Oliver Hellwig's .connlu files for the Digital Corpus of Sanskrit (https://github.com/OliverHellwig/sanskrit/tree/master/dcs/data/conllu)? I'd love to experiment with that, but I frankly don't know how difficult it is to get started---plus, if someone's already doing it, we could collaborate on it.
Best,
Raf
Beta Was this translation helpful? Give feedback.
All reactions