-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds mixing loader for FSL datasets #70
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a complete review yet but here are some high-level comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good! A couple more minor comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This change set introduces a new mechanism for building a
NumpyFSLDataset
from a predetermined ratio of tokens per source. The UX is to supply aSourceMixtureDatasetConfig
class toNumpyFSLDatasetMixtureConfig
that at data loading time counts tokens for all the sources/files provided, validates the intended ratios can be met from the source files, and builds the mixture of paths and number of instances to retain from each file that is then used to construct a typicalNumpyFSLDataset
with some modifications:SourceMixtureDataset
.SourceMixtureDataset
.Lots of tests were added for the new mixture class and some downstream tests in numpy_dataset as well.