Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor fixes to mason + caching #595

Merged
merged 2 commits into from
Mar 7, 2025
Merged

Minor fixes to mason + caching #595

merged 2 commits into from
Mar 7, 2025

Conversation

hamishivi
Copy link
Collaborator

  1. Add the ability to manually specify extra secrets for mason.py
  2. respect the split argument for datasets. The old code split if you used a non-train split for the eval dataset. This fixes it by instead constructing a dataset dict following the specified splits in the dataset config.

@hamishivi hamishivi requested a review from vwxyzjn March 6, 2025 05:41
Copy link
Collaborator

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition on the secret. The dataset_transformation.py might have some issues.

Comment on lines 813 to 817
split = dc.dataset_split
if split in transformed_datasets:
transformed_datasets[split] = concatenate_datasets([transformed_datasets[split], dataset])
else:
transformed_datasets[split] = dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a use case, say I want to mix "train, test, train, test" splits of 4 datasets. Would this implementation only load "train, train" of the first and thrid dataset?

What do you think about making a default split like

# NOTE: the cached dataset is always train split
DEFAULT_SPLIT_FOR_CACHED_DATASET = "train"
# Check if the revision exists
if revision_exists(repo_name, config_hash, repo_type="dataset"):
print(f"✅ Found cached dataset at https://huggingface.co/datasets/{repo_name}/tree/{config_hash}")
# Use the split from the first dataset config as default
return load_dataset(repo_name, split=DEFAULT_SPLIT_FOR_CACHED_DATASET, revision=config_hash)

and

print(f"✅ Found cached dataset at https://huggingface.co/datasets/{repo_name}/tree/{config_hash}")
return load_dataset(repo_name, split=DEFAULT_SPLIT_FOR_CACHED_DATASET, revision=config_hash)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess i was thinking of this differently, this setup makes more sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay changed to this, and tested for my use case and it works!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you push the commit?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol im dumb

@hamishivi hamishivi requested a review from vwxyzjn March 6, 2025 17:50
@vwxyzjn vwxyzjn merged commit c102c42 into main Mar 7, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants