fix: use temp files in push_to_hub to prevent OOM on large datasets #7895

The-Obstacle-Is-The-Way · 2025-12-03T04:33:55Z

Summary

Fixes memory accumulation in _push_parquet_shards_to_hub_single that causes OOM when uploading large datasets with many shards.

Root cause: The current implementation stores ALL parquet shard bytes in memory via BytesIO, accumulating in the additions list. For N shards of ~300MB each, this requires N × 300MB RAM.

Fix: Write parquet to temp file instead of BytesIO, pass file path to CommitOperationAdd. Delete temp file after preupload_lfs_files completes (for LFS uploads only - regular uploads need the file until create_commit).

Changes

Replace BytesIO with tempfile.NamedTemporaryFile in _push_parquet_shards_to_hub_single
Use file path in CommitOperationAdd.path_or_fileobj instead of bytes
Delete temp file after upload (only for LFS mode - regular uploads keep file for create_commit)
Add try...finally for safe cleanup even on errors
Remove unused BytesIO import

Test plan

Added tests/test_push_to_hub_memory.py with 4 tests:
- test_push_to_hub_uses_file_path_not_bytes_in_commit_operation
- test_push_to_hub_cleans_up_temp_files_for_lfs_uploads
- test_push_to_hub_keeps_temp_files_for_regular_uploads
- test_push_to_hub_uploaded_size_still_calculated
All tests pass locally
ruff check passes

Context

This was discovered while uploading a 270GB neuroimaging dataset (ARC) with 902 shards. The process was killed by OOM after accumulating ~270GB in the additions list.

Fixes #7893
Related: #5990, #7400

When uploading large datasets with many shards, the previous implementation held all parquet shard bytes in memory via the `additions` list. This caused OOM for datasets with hundreds of shards (e.g., 902 shards × 300MB = 270GB). Changes: - Write parquet to temp file instead of BytesIO - Pass file path to CommitOperationAdd (huggingface_hub streams from disk) - Delete temp file after preupload_lfs_files completes - Remove unused BytesIO import Memory usage now stays constant (~1-2 shard sizes) instead of growing linearly with the number of shards. Fixes: #5

Wrap the fallible upload operations in try...finally to guarantee temp file cleanup even if preupload_lfs_files() throws (network timeout, 500 error, KeyboardInterrupt). Without this, a flaky upload of 900 shards could leave hundreds of ~300MB temp files in /tmp, potentially filling the disk.

Addresses CodeRabbit's CRITICAL review feedback: 1. Only delete temp file when _upload_mode == "lfs" - For LFS uploads, content is already on the Hub after preupload_lfs_files - For regular uploads, create_commit still needs to read from disk - Deleting temp files unconditionally broke non-LFS uploads 2. Replace XXXX placeholder with actual issue reference 3. Update tests to properly simulate huggingface_hub behavior: - Mock preupload_lfs_files to set _upload_mode attribute - Add test for regular (non-LFS) upload behavior - Mock upload_info to make size assertion meaningful

The-Obstacle-Is-The-Way added 4 commits December 2, 2025 23:10

chore: remove fork-specific issue URL for upstream compatibility

44936c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: use temp files in push_to_hub to prevent OOM on large datasets #7895

fix: use temp files in push_to_hub to prevent OOM on large datasets #7895

The-Obstacle-Is-The-Way commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: use temp files in push_to_hub to prevent OOM on large datasets #7895

Are you sure you want to change the base?

fix: use temp files in push_to_hub to prevent OOM on large datasets #7895

Conversation

The-Obstacle-Is-The-Way commented Dec 3, 2025

Summary

Changes

Test plan

Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant