-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor fixes to mason + caching #595
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition on the secret
. The dataset_transformation.py
might have some issues.
split = dc.dataset_split | ||
if split in transformed_datasets: | ||
transformed_datasets[split] = concatenate_datasets([transformed_datasets[split], dataset]) | ||
else: | ||
transformed_datasets[split] = dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a use case, say I want to mix "train, test, train, test" splits of 4 datasets. Would this implementation only load "train, train"
of the first and thrid dataset?
What do you think about making a default split like
open-instruct/open_instruct/dataset_transformation.py
Lines 801 to 808 in 7d1c5d9
# NOTE: the cached dataset is always train split | |
DEFAULT_SPLIT_FOR_CACHED_DATASET = "train" | |
# Check if the revision exists | |
if revision_exists(repo_name, config_hash, repo_type="dataset"): | |
print(f"✅ Found cached dataset at https://huggingface.co/datasets/{repo_name}/tree/{config_hash}") | |
# Use the split from the first dataset config as default | |
return load_dataset(repo_name, split=DEFAULT_SPLIT_FOR_CACHED_DATASET, revision=config_hash) |
and
open-instruct/open_instruct/dataset_transformation.py
Lines 858 to 859 in 7d1c5d9
print(f"✅ Found cached dataset at https://huggingface.co/datasets/{repo_name}/tree/{config_hash}") | |
return load_dataset(repo_name, split=DEFAULT_SPLIT_FOR_CACHED_DATASET, revision=config_hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I guess i was thinking of this differently, this setup makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay changed to this, and tested for my use case and it works!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you push the commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol im dumb
mason.py