Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
InitializationWordEmbed		InitializationWordEmbed
LLaMACode		LLaMACode
Wechsel_Setup		Wechsel_Setup
tokenizer_evaluation		tokenizer_evaluation
tokenizer_setup		tokenizer_setup
tokenizer_support_indicVocab		tokenizer_support_indicVocab
word2vec		word2vec
IndicMPT_inference.py		IndicMPT_inference.py
IndicMPT_train.py		IndicMPT_train.py
MPT_inference.py		MPT_inference.py
MPT_train.py		MPT_train.py
README.md		README.md
combine_tok_before_mpt.py		combine_tok_before_mpt.py
combine_tokenizer.py		combine_tokenizer.py
compare_custom_model.py		compare_custom_model.py
merge_training_seed.py		merge_training_seed.py
seed_data_test_Split.py		seed_data_test_Split.py
tokenizer_specification.py		tokenizer_specification.py
train_tokenizer.py		train_tokenizer.py

Repository files navigation

Vocabulary Adaptation MPT and BLOOM model

Tokenizer-embed Pipeline

To train Indic Tokenizer and get the final tokenizer follow tokenizer_setup directory
To evaluate the resulting tokenizer follow tokenizer_evaluation directory
To get embedding using wechsel follow Wechsel_Setup
To initialize the word embedding layer of model follow InitializationWordEmbed

Result

Please find result on https://docs.google.com/spreadsheets/d/1npkCffkNyztbPZokK9vis19zvzzT07l-uWnN06aiOeQ/edit#gid=868636088
Please find Meeting Notes/To-Do list/observation/.. on - https://docs.google.com/document/d/1dOegfXg8v5NBYXlCZgLDnkLBjP1YD_6K47kHh_5ojd0/edit

File specification

seed_data_test_split.py contains code to split seed dataset for train(90%) and test(10%)
merge_training_seed.py -> code to merge the training data
tokenizer_specification.py -> code to find how two tokenizer are related, such as intersecting token, or avg tokenization length per sentence
combine_tokenizer.py -> contains code to combine two tokenizer (The one used for extended version)
train_tokenizer.py -> train tokenizer from scratch
MPT_inference.py and IndicMPT_inference.py -> code to calculate the perplexity score of just inferncing(no training)
MPT_train.py and IndicMPT_train.py -> contains code to train LoRA adapetr and Word Embedding layer of model

About

No description, website, or topics provided.

Custom properties

Report repository

Releases

No releases published

Packages

No packages published

Contributors 2

Languages