- To train Indic Tokenizer and get the final tokenizer follow tokenizer_setup directory
- To evaluate the resulting tokenizer follow tokenizer_evaluation directory
- To get embedding using wechsel follow Wechsel_Setup
- To initialize the word embedding layer of model follow InitializationWordEmbed
- Please find result on https://docs.google.com/spreadsheets/d/1npkCffkNyztbPZokK9vis19zvzzT07l-uWnN06aiOeQ/edit#gid=868636088
- Please find Meeting Notes/To-Do list/observation/.. on - https://docs.google.com/document/d/1dOegfXg8v5NBYXlCZgLDnkLBjP1YD_6K47kHh_5ojd0/edit
- seed_data_test_split.py contains code to split seed dataset for train(90%) and test(10%)
- merge_training_seed.py -> code to merge the training data
- tokenizer_specification.py -> code to find how two tokenizer are related, such as intersecting token, or avg tokenization length per sentence
- combine_tokenizer.py -> contains code to combine two tokenizer (The one used for extended version)
- train_tokenizer.py -> train tokenizer from scratch
- MPT_inference.py and IndicMPT_inference.py -> code to calculate the perplexity score of just inferncing(no training)
- MPT_train.py and IndicMPT_train.py -> contains code to train LoRA adapetr and Word Embedding layer of model