Skip to content

Dataloading should fail if requested Mixture is not feasible#1248

Open
ahmeda14960 wants to merge 2 commits intomainfrom
fix_subtle_tokenize
Open

Dataloading should fail if requested Mixture is not feasible#1248
ahmeda14960 wants to merge 2 commits intomainfrom
fix_subtle_tokenize

Conversation

@ahmeda14960
Copy link
Copy Markdown
Contributor

Right now if we request mixtures through LMMixtureDataset that have neither the raw urls or a finished cached, levanter will just skip and not include the dataset in the mixture. This can lead to bad silent errors so instead we will complain loudly and fail

@ahmeda14960 ahmeda14960 requested review from dlwh and rjpower October 10, 2025 01:06
Comment thread src/levanter/data/text.py
if source is None:
logger.warning(f"Skipping {split} because no source was provided")
return None
if split == "train":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is "train" special, should we always fail if we don't have the source?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yeah so this isn't right. we have lots of sources where there is no training set (e.g. paloma)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I was thinking we might not have a validation set while training but still continue training.... for example we might not have validation urls for each source in common pile, in which case this would fail.

For paloma, we always use it with a data config that has training urls correct? so this seems fine.

What are you thinking instead? require validation for each train source?

@rjpower
Copy link
Copy Markdown
Collaborator

rjpower commented Oct 10, 2025

LGTM, but I'll defer to David -- I don't have enough context!

Comment thread src/levanter/data/text.py
if source is None:
logger.warning(f"Skipping {split} because no source was provided")
return None
if split == "train":
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yeah so this isn't right. we have lots of sources where there is no training set (e.g. paloma)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants