Skip to content

push_to_hub OOM: _push_parquet_shards_to_hub accumulates all shard bytes in memory #7893

@The-Obstacle-Is-The-Way

Description

@The-Obstacle-Is-The-Way

Summary

Large dataset uploads crash or hang due to memory exhaustion. This appears to be the root cause of several long-standing issues.

Related Issues

This is the root cause of:

Context

Discovered while uploading the Aphasia Recovery Cohort (ARC) neuroimaging dataset (~270GB, 902 sessions) to HuggingFace Hub using the Nifti() feature.

Working implementation with workaround: arc-aphasia-bids

Root Cause

In _push_parquet_shards_to_hub (arrow_dataset.py), the additions list accumulates every CommitOperationAdd with full Parquet bytes in memory:

additions = []
for shard in shards:
    parquet_content = shard.to_parquet_bytes()  # ~300 MB per shard
    shard_addition = CommitOperationAdd(path_or_fileobj=parquet_content)
    api.preupload_lfs_files(additions=[shard_addition])
    additions.append(shard_addition)  # THE BUG: bytes stay in memory forever

For a 902-shard dataset: 902 × 300 MB = ~270 GB RAM requested → OOM/hang.

The bytes are held until the final create_commit() call, preventing garbage collection.

Reproduction

from datasets import load_dataset

# Any large dataset with embedded files (Image, Audio, Nifti, etc.)
ds = load_dataset("imagefolder", data_dir="path/to/large/dataset")
ds.push_to_hub("repo-id", num_shards=500)  # Watch memory grow until crash

Workaround

Process one shard at a time, upload via HfApi.upload_file(path=...), delete before next iteration:

from huggingface_hub import HfApi
import pyarrow.parquet as pq

api = HfApi()
for i in range(num_shards):
    shard = ds.shard(num_shards=num_shards, index=i, contiguous=True)
    
    # Write to disk, not memory
    shard.to_parquet(local_path)
    
    # Upload from file path (streams from disk)
    api.upload_file(
        path_or_fileobj=str(local_path),
        path_in_repo=f"data/train-{i:05d}-of-{num_shards:05d}.parquet",
        repo_id=repo_id,
        repo_type="dataset",
    )
    
    # Clean up before next iteration
    local_path.unlink()
    del shard

Memory usage stays constant (~1-2 GB) instead of growing linearly.

Suggested Fix

After preupload_lfs_files succeeds for each shard, release the bytes:

  1. Clear path_or_fileobj from the CommitOperationAdd after preupload
  2. Or write to temp file and pass file path instead of bytes
  3. Or commit incrementally instead of batching all additions

Environment

  • datasets version: main branch (post-0.22.0)
  • Platform: macOS 14.x ARM64
  • Python: 3.13
  • PyArrow: 18.1.0
  • Dataset: 902 shards, ~270 GB total embedded NIfTI files

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions