-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Summary
Large dataset uploads crash or hang due to memory exhaustion. This appears to be the root cause of several long-standing issues.
Related Issues
This is the root cause of:
- Pushing a large dataset on the hub consistently hangs #5990 - Pushing a large dataset on the hub consistently hangs (46 comments, open since 2023)
- 504 Gateway Timeout when uploading large dataset to Hugging Face Hub #7400 - 504 Gateway Timeout when uploading large dataset
- Question: Is there any way for uploading a large image dataset? #6686 - Question: Is there any way for uploading a large image dataset?
Context
Discovered while uploading the Aphasia Recovery Cohort (ARC) neuroimaging dataset (~270GB, 902 sessions) to HuggingFace Hub using the Nifti() feature.
Working implementation with workaround: arc-aphasia-bids
Root Cause
In _push_parquet_shards_to_hub (arrow_dataset.py), the additions list accumulates every CommitOperationAdd with full Parquet bytes in memory:
additions = []
for shard in shards:
parquet_content = shard.to_parquet_bytes() # ~300 MB per shard
shard_addition = CommitOperationAdd(path_or_fileobj=parquet_content)
api.preupload_lfs_files(additions=[shard_addition])
additions.append(shard_addition) # THE BUG: bytes stay in memory foreverFor a 902-shard dataset: 902 × 300 MB = ~270 GB RAM requested → OOM/hang.
The bytes are held until the final create_commit() call, preventing garbage collection.
Reproduction
from datasets import load_dataset
# Any large dataset with embedded files (Image, Audio, Nifti, etc.)
ds = load_dataset("imagefolder", data_dir="path/to/large/dataset")
ds.push_to_hub("repo-id", num_shards=500) # Watch memory grow until crashWorkaround
Process one shard at a time, upload via HfApi.upload_file(path=...), delete before next iteration:
from huggingface_hub import HfApi
import pyarrow.parquet as pq
api = HfApi()
for i in range(num_shards):
shard = ds.shard(num_shards=num_shards, index=i, contiguous=True)
# Write to disk, not memory
shard.to_parquet(local_path)
# Upload from file path (streams from disk)
api.upload_file(
path_or_fileobj=str(local_path),
path_in_repo=f"data/train-{i:05d}-of-{num_shards:05d}.parquet",
repo_id=repo_id,
repo_type="dataset",
)
# Clean up before next iteration
local_path.unlink()
del shardMemory usage stays constant (~1-2 GB) instead of growing linearly.
Suggested Fix
After preupload_lfs_files succeeds for each shard, release the bytes:
- Clear
path_or_fileobjfrom theCommitOperationAddafter preupload - Or write to temp file and pass file path instead of bytes
- Or commit incrementally instead of batching all additions
Environment
- datasets version: main branch (post-0.22.0)
- Platform: macOS 14.x ARM64
- Python: 3.13
- PyArrow: 18.1.0
- Dataset: 902 shards, ~270 GB total embedded NIfTI files