You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered an inconsistency in behavior between load_dataset and load_from_disk when loading sharded datasets. Here is a minimal example to reproduce the issue:
Code 1: Using load_dataset
fromdatasetsimportDataset, load_dataset# First save with max_shard_size=10Dataset.from_dict({"id": range(1000)}).train_test_split(test_size=0.1).save_to_disk("my_sharded_datasetdict", max_shard_size=10)
# Second save with max_shard_size=10Dataset.from_dict({"id": range(500)}).train_test_split(test_size=0.1).save_to_disk("my_sharded_datasetdict", max_shard_size=10)
# Load the DatasetDictloaded_datasetdict=load_dataset("my_sharded_datasetdict")
print(loaded_datasetdict)
Output:
train has 1350 samples.
test has 150 samples.
Code 2: Using load_from_disk
fromdatasetsimportDataset, load_from_disk# First save with max_shard_size=10Dataset.from_dict({"id": range(1000)}).train_test_split(test_size=0.1).save_to_disk("my_sharded_datasetdict", max_shard_size=10)
# Second save with max_shard_size=10Dataset.from_dict({"id": range(500)}).train_test_split(test_size=0.1).save_to_disk("my_sharded_datasetdict", max_shard_size=10)
# Load the DatasetDictloaded_datasetdict=load_from_disk("my_sharded_datasetdict")
print(loaded_datasetdict)
Output:
train has 450 samples.
test has 50 samples.
Expected Behavior
I expected both load_dataset and load_from_disk to load the same dataset, as they are pointing to the same directory. However, the results differ significantly:
load_dataset seems to merge all shards, resulting in a combined dataset.
load_from_disk only loads the last saved dataset, ignoring previous shards.
Questions
Is this behavior intentional? If so, could you clarify the difference between load_dataset and load_from_disk in the documentation?
If this is not intentional, could this be considered a bug?
What is the recommended way to handle cases where multiple datasets are saved to the same directory?
Thank you for your time and effort in maintaining this great library! I look forward to your feedback.
The text was updated successfully, but these errors were encountered:
Description
I encountered an inconsistency in behavior between
load_dataset
andload_from_disk
when loading sharded datasets. Here is a minimal example to reproduce the issue:Code 1: Using
load_dataset
Output:
train
has 1350 samples.test
has 150 samples.Code 2: Using
load_from_disk
Output:
train
has 450 samples.test
has 50 samples.Expected Behavior
I expected both
load_dataset
andload_from_disk
to load the same dataset, as they are pointing to the same directory. However, the results differ significantly:load_dataset
seems to merge all shards, resulting in a combined dataset.load_from_disk
only loads the last saved dataset, ignoring previous shards.Questions
load_dataset
andload_from_disk
in the documentation?Thank you for your time and effort in maintaining this great library! I look forward to your feedback.
The text was updated successfully, but these errors were encountered: