Skip to content

Conversation

@The-Obstacle-Is-The-Way
Copy link

Summary

Fixes memory accumulation in _push_parquet_shards_to_hub_single that causes OOM when uploading large datasets with many shards.

Root cause: The current implementation stores ALL parquet shard bytes in memory via BytesIO, accumulating in the additions list. For N shards of ~300MB each, this requires N × 300MB RAM.

Fix: Write parquet to temp file instead of BytesIO, pass file path to CommitOperationAdd. Delete temp file after preupload_lfs_files completes (for LFS uploads only - regular uploads need the file until create_commit).

Changes

  • Replace BytesIO with tempfile.NamedTemporaryFile in _push_parquet_shards_to_hub_single
  • Use file path in CommitOperationAdd.path_or_fileobj instead of bytes
  • Delete temp file after upload (only for LFS mode - regular uploads keep file for create_commit)
  • Add try...finally for safe cleanup even on errors
  • Remove unused BytesIO import

Test plan

  • Added tests/test_push_to_hub_memory.py with 4 tests:
    • test_push_to_hub_uses_file_path_not_bytes_in_commit_operation
    • test_push_to_hub_cleans_up_temp_files_for_lfs_uploads
    • test_push_to_hub_keeps_temp_files_for_regular_uploads
    • test_push_to_hub_uploaded_size_still_calculated
  • All tests pass locally
  • ruff check passes

Context

This was discovered while uploading a 270GB neuroimaging dataset (ARC) with 902 shards. The process was killed by OOM after accumulating ~270GB in the additions list.

Fixes #7893
Related: #5990, #7400

When uploading large datasets with many shards, the previous implementation
held all parquet shard bytes in memory via the `additions` list. This caused
OOM for datasets with hundreds of shards (e.g., 902 shards × 300MB = 270GB).

Changes:
- Write parquet to temp file instead of BytesIO
- Pass file path to CommitOperationAdd (huggingface_hub streams from disk)
- Delete temp file after preupload_lfs_files completes
- Remove unused BytesIO import

Memory usage now stays constant (~1-2 shard sizes) instead of growing
linearly with the number of shards.

Fixes: #5
Wrap the fallible upload operations in try...finally to guarantee
temp file cleanup even if preupload_lfs_files() throws (network timeout,
500 error, KeyboardInterrupt).

Without this, a flaky upload of 900 shards could leave hundreds of
~300MB temp files in /tmp, potentially filling the disk.
Addresses CodeRabbit's CRITICAL review feedback:

1. Only delete temp file when _upload_mode == "lfs"
   - For LFS uploads, content is already on the Hub after preupload_lfs_files
   - For regular uploads, create_commit still needs to read from disk
   - Deleting temp files unconditionally broke non-LFS uploads

2. Replace XXXX placeholder with actual issue reference

3. Update tests to properly simulate huggingface_hub behavior:
   - Mock preupload_lfs_files to set _upload_mode attribute
   - Add test for regular (non-LFS) upload behavior
   - Mock upload_info to make size assertion meaningful
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

push_to_hub OOM: _push_parquet_shards_to_hub accumulates all shard bytes in memory

1 participant