Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Behavior Between load_dataset and load_from_disk When Loading Sharded Datasets #7372

Open
gaohongkui opened this issue Jan 16, 2025 · 0 comments

Comments

@gaohongkui
Copy link

Description

I encountered an inconsistency in behavior between load_dataset and load_from_disk when loading sharded datasets. Here is a minimal example to reproduce the issue:

Code 1: Using load_dataset

from datasets import Dataset, load_dataset

# First save with max_shard_size=10
Dataset.from_dict({"id": range(1000)}).train_test_split(test_size=0.1).save_to_disk("my_sharded_datasetdict", max_shard_size=10)

# Second save with max_shard_size=10
Dataset.from_dict({"id": range(500)}).train_test_split(test_size=0.1).save_to_disk("my_sharded_datasetdict", max_shard_size=10)

# Load the DatasetDict
loaded_datasetdict = load_dataset("my_sharded_datasetdict")
print(loaded_datasetdict)

Output:

  • train has 1350 samples.
  • test has 150 samples.

Code 2: Using load_from_disk

from datasets import Dataset, load_from_disk

# First save with max_shard_size=10
Dataset.from_dict({"id": range(1000)}).train_test_split(test_size=0.1).save_to_disk("my_sharded_datasetdict", max_shard_size=10)

# Second save with max_shard_size=10
Dataset.from_dict({"id": range(500)}).train_test_split(test_size=0.1).save_to_disk("my_sharded_datasetdict", max_shard_size=10)

# Load the DatasetDict
loaded_datasetdict = load_from_disk("my_sharded_datasetdict")
print(loaded_datasetdict)

Output:

  • train has 450 samples.
  • test has 50 samples.

Expected Behavior

I expected both load_dataset and load_from_disk to load the same dataset, as they are pointing to the same directory. However, the results differ significantly:

  • load_dataset seems to merge all shards, resulting in a combined dataset.
  • load_from_disk only loads the last saved dataset, ignoring previous shards.

Questions

  1. Is this behavior intentional? If so, could you clarify the difference between load_dataset and load_from_disk in the documentation?
  2. If this is not intentional, could this be considered a bug?
  3. What is the recommended way to handle cases where multiple datasets are saved to the same directory?

Thank you for your time and effort in maintaining this great library! I look forward to your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant