You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to train llama3.1 model with variable sequence length as mentioned in Meta's paper. That is, to start the training with one sequence length and after a number of consumed tokens switch to a larger sequence length and continue training.
Several approaches in my opinion:
Train on dataset A with sequence length A for some iterations and save checkpoint. Continue training from checkpoint with a dataset B which has sequence length B. At runtime seq_length will be fixed and can be taken from args. This will require minimal changes.
Use BlendedMegatronDataset which can be used to build a dataset from several datasets. Currently, according to BlendedMegatronDatasetConfig all datasets share same seq_len. This could be changed to make seq_len a list instead of an int. The high level dataset will have variable sized samples. Some samples will come from dataset A with seq_len A and some will have seq_len B from dataset B. In this approach seq_length will not be fixed. It could be taken from a global function that returns the seq_length based on the index of the sample or the length of the sample.
A low level dataset could be comprised of variable sequence lengths. It will include a vector with the changes in sequence length. In runtime the sequence length would be taken from sample. At runtime sequence length is needed in several functions. In runtime seq_length will not be fixed, as in previous approach.
This discussion was converted from issue #1047 on September 04, 2024 18:24.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I'd like to train llama3.1 model with variable sequence length as mentioned in Meta's paper. That is, to start the training with one sequence length and after a number of consumed tokens switch to a larger sequence length and continue training.
Several approaches in my opinion:
Any thoughts?
Beta Was this translation helpful? Give feedback.
All reactions