Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds mixing loader for FSL datasets #70

Merged
merged 61 commits into from
Nov 1, 2024
Merged

Conversation

undfined
Copy link
Collaborator

@undfined undfined commented Oct 19, 2024

This change set introduces a new mechanism for building a NumpyFSLDataset from a predetermined ratio of tokens per source. The UX is to supply a SourceMixtureDatasetConfig class to NumpyFSLDatasetMixtureConfig that at data loading time counts tokens for all the sources/files provided, validates the intended ratios can be met from the source files, and builds the mixture of paths and number of instances to retain from each file that is then used to construct a typical NumpyFSLDataset with some modifications:

Lots of tests were added for the new mixture class and some downstream tests in numpy_dataset as well.

@undfined undfined changed the title Adds Mixing loader Adds mixing loader for FSL datasets Oct 25, 2024
@undfined undfined requested a review from epwalsh October 25, 2024 22:54
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a complete review yet but here are some high-level comments

src/olmo_core/data/source_mixture.py Outdated Show resolved Hide resolved
src/olmo_core/data/source_mixture.py Outdated Show resolved Hide resolved
src/olmo_core/data/source_mixture.py Outdated Show resolved Hide resolved
src/olmo_core/data/source_mixture.py Outdated Show resolved Hide resolved
src/olmo_core/data/source_mixture.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
src/examples/train_with_mixture.py Outdated Show resolved Hide resolved
src/olmo_core/data/types.py Outdated Show resolved Hide resolved
src/olmo_core/data/numpy_dataset.py Outdated Show resolved Hide resolved
src/olmo_core/data/numpy_dataset.py Outdated Show resolved Hide resolved
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! A couple more minor comments

src/olmo_core/data/numpy_dataset.py Outdated Show resolved Hide resolved
src/olmo_core/data/numpy_dataset.py Outdated Show resolved Hide resolved
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments

src/olmo_core/data/utils.py Show resolved Hide resolved
src/olmo_core/data/source_mixture.py Outdated Show resolved Hide resolved
src/olmo_core/data/source_mixture.py Outdated Show resolved Hide resolved
src/olmo_core/data/source_mixture.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
src/olmo_core/data/numpy_dataset.py Outdated Show resolved Hide resolved
src/olmo_core/data/numpy_dataset.py Outdated Show resolved Hide resolved
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@undfined undfined merged commit 4928f82 into main Nov 1, 2024
14 checks passed
@undfined undfined deleted the undfined/mixing-loader branch November 1, 2024 18:22
@undfined undfined restored the undfined/mixing-loader branch November 1, 2024 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants