push_to_hub OOM: _push_parquet_shards_to_hub accumulates all shard bytes in memory

## Summary

Large dataset uploads crash or hang due to memory exhaustion. This appears to be the root cause of several long-standing issues.

### Related Issues

This is the root cause of:
- #5990 - Pushing a large dataset on the hub consistently hangs (46 comments, open since 2023)
- #7400 - 504 Gateway Timeout when uploading large dataset
- #6686 - Question: Is there any way for uploading a large image dataset?

### Context

Discovered while uploading the [Aphasia Recovery Cohort (ARC)](https://openneuro.org/datasets/ds004884) neuroimaging dataset (~270GB, 902 sessions) to HuggingFace Hub using the `Nifti()` feature.

Working implementation with workaround: [arc-aphasia-bids](https://github.com/The-Obstacle-Is-The-Way/arc-aphasia-bids)

## Root Cause

In `_push_parquet_shards_to_hub` (arrow_dataset.py), the `additions` list accumulates every `CommitOperationAdd` with full Parquet bytes in memory:

```python
additions = []
for shard in shards:
    parquet_content = shard.to_parquet_bytes()  # ~300 MB per shard
    shard_addition = CommitOperationAdd(path_or_fileobj=parquet_content)
    api.preupload_lfs_files(additions=[shard_addition])
    additions.append(shard_addition)  # THE BUG: bytes stay in memory forever
```

For a 902-shard dataset: **902 × 300 MB = ~270 GB RAM requested → OOM/hang**.

The bytes are held until the final `create_commit()` call, preventing garbage collection.

## Reproduction

```python
from datasets import load_dataset

# Any large dataset with embedded files (Image, Audio, Nifti, etc.)
ds = load_dataset("imagefolder", data_dir="path/to/large/dataset")
ds.push_to_hub("repo-id", num_shards=500)  # Watch memory grow until crash
```

## Workaround

Process one shard at a time, upload via `HfApi.upload_file(path=...)`, delete before next iteration:

```python
from huggingface_hub import HfApi
import pyarrow.parquet as pq

api = HfApi()
for i in range(num_shards):
    shard = ds.shard(num_shards=num_shards, index=i, contiguous=True)
    
    # Write to disk, not memory
    shard.to_parquet(local_path)
    
    # Upload from file path (streams from disk)
    api.upload_file(
        path_or_fileobj=str(local_path),
        path_in_repo=f"data/train-{i:05d}-of-{num_shards:05d}.parquet",
        repo_id=repo_id,
        repo_type="dataset",
    )
    
    # Clean up before next iteration
    local_path.unlink()
    del shard
```

Memory usage stays constant (~1-2 GB) instead of growing linearly.

## Suggested Fix

After `preupload_lfs_files` succeeds for each shard, release the bytes:

1. Clear `path_or_fileobj` from the `CommitOperationAdd` after preupload
2. Or write to temp file and pass file path instead of bytes
3. Or commit incrementally instead of batching all additions

## Environment

- datasets version: main branch (post-0.22.0)
- Platform: macOS 14.x ARM64
- Python: 3.13
- PyArrow: 18.1.0
- Dataset: 902 shards, ~270 GB total embedded NIfTI files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

push_to_hub OOM: _push_parquet_shards_to_hub accumulates all shard bytes in memory #7893

Summary

Related Issues

Context

Root Cause

Reproduction

Workaround

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

push_to_hub OOM: _push_parquet_shards_to_hub accumulates all shard bytes in memory #7893

Description

Summary

Related Issues

Context

Root Cause

Reproduction

Workaround

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions