Minor fixes to mason + caching #595

hamishivi · 2025-03-06T05:40:58Z

Add the ability to manually specify extra secrets for mason.py
respect the split argument for datasets. The old code split if you used a non-train split for the eval dataset. This fixes it by instead constructing a dataset dict following the specified splits in the dataset config.

vwxyzjn

Nice addition on the secret. The dataset_transformation.py might have some issues.

vwxyzjn · 2025-03-06T15:28:29Z

open_instruct/dataset_transformation.py

+            split = dc.dataset_split
+            if split in transformed_datasets:
+                transformed_datasets[split] = concatenate_datasets([transformed_datasets[split], dataset])
+            else:
+                transformed_datasets[split] = dataset


Here is a use case, say I want to mix "train, test, train, test" splits of 4 datasets. Would this implementation only load "train, train" of the first and thrid dataset?

What do you think about making a default split like

open-instruct/open_instruct/dataset_transformation.py

Lines 801 to 808 in 7d1c5d9

# NOTE: the cached dataset is always train split

DEFAULT_SPLIT_FOR_CACHED_DATASET = "train"

# Check if the revision exists

if revision_exists(repo_name, config_hash, repo_type="dataset"):

print(f"✅ Found cached dataset at https://huggingface.co/datasets/{repo_name}/tree/{config_hash}")

# Use the split from the first dataset config as default

return load_dataset(repo_name, split=DEFAULT_SPLIT_FOR_CACHED_DATASET, revision=config_hash)

and

open-instruct/open_instruct/dataset_transformation.py

Lines 858 to 859 in 7d1c5d9

print(f"✅ Found cached dataset at https://huggingface.co/datasets/{repo_name}/tree/{config_hash}")

return load_dataset(repo_name, split=DEFAULT_SPLIT_FOR_CACHED_DATASET, revision=config_hash)

Ah, I guess i was thinking of this differently, this setup makes more sense.

Okay changed to this, and tested for my use case and it works!

Did you push the commit?

lol im dumb

copy changes from rl-rag

a3adb3a

hamishivi requested a review from vwxyzjn March 6, 2025 05:41

vwxyzjn reviewed Mar 6, 2025

View reviewed changes

use costas change

4601616

hamishivi requested a review from vwxyzjn March 6, 2025 17:50

vwxyzjn approved these changes Mar 7, 2025

View reviewed changes

vwxyzjn merged commit c102c42 into main Mar 7, 2025
3 checks passed

vwxyzjn mentioned this pull request Mar 10, 2025

Using 'train' split when loading cached dataset #569

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor fixes to mason + caching #595

Minor fixes to mason + caching #595

hamishivi commented Mar 6, 2025

vwxyzjn left a comment

vwxyzjn Mar 6, 2025

hamishivi Mar 6, 2025

hamishivi Mar 6, 2025

vwxyzjn Mar 7, 2025

hamishivi Mar 7, 2025

	# NOTE: the cached dataset is always train split
	DEFAULT_SPLIT_FOR_CACHED_DATASET = "train"

	# Check if the revision exists
	if revision_exists(repo_name, config_hash, repo_type="dataset"):
	print(f"✅ Found cached dataset at https://huggingface.co/datasets/{repo_name}/tree/{config_hash}")
	# Use the split from the first dataset config as default
	return load_dataset(repo_name, split=DEFAULT_SPLIT_FOR_CACHED_DATASET, revision=config_hash)

Minor fixes to mason + caching #595

Minor fixes to mason + caching #595

Conversation

hamishivi commented Mar 6, 2025

vwxyzjn left a comment

Choose a reason for hiding this comment

vwxyzjn Mar 6, 2025

Choose a reason for hiding this comment

hamishivi Mar 6, 2025

Choose a reason for hiding this comment

hamishivi Mar 6, 2025

Choose a reason for hiding this comment

vwxyzjn Mar 7, 2025

Choose a reason for hiding this comment

hamishivi Mar 7, 2025

Choose a reason for hiding this comment