Dataloading should fail if requested Mixture is not feasible#1248
Dataloading should fail if requested Mixture is not feasible#1248ahmeda14960 wants to merge 2 commits intomainfrom
Conversation
| if source is None: | ||
| logger.warning(f"Skipping {split} because no source was provided") | ||
| return None | ||
| if split == "train": |
There was a problem hiding this comment.
Why is "train" special, should we always fail if we don't have the source?
There was a problem hiding this comment.
ok yeah so this isn't right. we have lots of sources where there is no training set (e.g. paloma)
There was a problem hiding this comment.
Oh I was thinking we might not have a validation set while training but still continue training.... for example we might not have validation urls for each source in common pile, in which case this would fail.
For paloma, we always use it with a data config that has training urls correct? so this seems fine.
What are you thinking instead? require validation for each train source?
|
LGTM, but I'll defer to David -- I don't have enough context! |
| if source is None: | ||
| logger.warning(f"Skipping {split} because no source was provided") | ||
| return None | ||
| if split == "train": |
There was a problem hiding this comment.
ok yeah so this isn't right. we have lots of sources where there is no training set (e.g. paloma)
Right now if we request mixtures through LMMixtureDataset that have neither the raw urls or a finished cached, levanter will just skip and not include the dataset in the mixture. This can lead to bad silent errors so instead we will complain loudly and fail