Skip to content

AI4Bharat/VocabAdaptation_LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vocabulary Adaptation MPT and BLOOM model

Tokenizer-embed Pipeline

  1. To train Indic Tokenizer and get the final tokenizer follow tokenizer_setup directory
  2. To evaluate the resulting tokenizer follow tokenizer_evaluation directory
  3. To get embedding using wechsel follow Wechsel_Setup
  4. To initialize the word embedding layer of model follow InitializationWordEmbed

Result

  1. Please find result on https://docs.google.com/spreadsheets/d/1npkCffkNyztbPZokK9vis19zvzzT07l-uWnN06aiOeQ/edit#gid=868636088
  2. Please find Meeting Notes/To-Do list/observation/.. on - https://docs.google.com/document/d/1dOegfXg8v5NBYXlCZgLDnkLBjP1YD_6K47kHh_5ojd0/edit

File specification

  1. seed_data_test_split.py contains code to split seed dataset for train(90%) and test(10%)
  2. merge_training_seed.py -> code to merge the training data
  3. tokenizer_specification.py -> code to find how two tokenizer are related, such as intersecting token, or avg tokenization length per sentence
  4. combine_tokenizer.py -> contains code to combine two tokenizer (The one used for extended version)
  5. train_tokenizer.py -> train tokenizer from scratch
  6. MPT_inference.py and IndicMPT_inference.py -> code to calculate the perplexity score of just inferncing(no training)
  7. MPT_train.py and IndicMPT_train.py -> contains code to train LoRA adapetr and Word Embedding layer of model

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published