Add bulk download engine with concurrent workers#743
Closed
Conversation
Foundation for the bulk operations framework. BaseWorker defines the interface that operation-specific workers (download, upload, etc.) must implement. WorkerResult and VerifyResult are the return types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Optional callback invoked with byte count after each chunk write. Defaults to None (no-op). Enables TUI progress bars in the bulk download engine without changing existing behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Thread-safe, append-only JSONL job log for tracking bulk download progress. Supports resume via should_skip() with rules: completed and permanently-skipped items are skipped on re-run, failed and retryable items are retried. Includes status() summary and context manager protocol. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Manages multiple destination directories with free space detection via os.statvfs, configurable margin, reservation tracking to prevent over-commit, and mark_full() for ENOSPC recovery. Thread-safe via Lock. Includes parse_size() helper for human-readable sizes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Plain text UI for non-TTY environments. UIEvent dataclass carries event data (kind, identifier, progress, errors). PlainUI dispatches events to handlers with timestamped output format [HH:MM:SS] [idx/total] identifier: message. Includes print_summary() and _format_bytes() helper. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Operation-specific worker for bulk downloads. Uses per-thread sessions for thread safety. Implements estimate_size, execute, and verify from BaseWorker interface. Handles dark items, exceptions, and accepts both str and Path for destdir. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Core engine that ties together worker, job log, disk pool, and UI. Supports concurrent execution via ThreadPoolExecutor with semaphore- gated submission, resume from joblog, job-level retry, graceful shutdown via request_stop(), and pause/resume flow control. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
16 tests covering DownloadWorker.verify() including complete items, missing files, no item directory, empty items, error handling, subdirectory files, and result field validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Curses-based fallback TUI. Separated state tracking (TUIState, fully testable) from rendering (CursesTUI). Shows worker status with progress bars, item counts, and recent completions. Runs renderer in dedicated daemon thread. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds --workers, --joblog, --destdirs, --disk-margin, --no-disk-check, --status, and --verify flags to ia download. Dispatches to bulk engine when --workers > 1 with multi-item input (--search, --itemlist, stdin). Existing single-item download path unchanged. Includes commands.py with bulk_download, bulk_status, and bulk_verify entry points. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests 3 items with 2 workers using mocked HTTP, verifies joblog entries, on-disk files, resume semantics, and single-worker mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New bulk-download.rst with comprehensive guide covering CLI options, job log format, multi-disk routing, resume semantics, verification, and troubleshooting. Updated cli.rst with bulk download section and examples. Updated index.rst and parallel.rst with cross-references. Added HISTORY.rst entry for the feature. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rich-based TUI with styled output, progress bars, and color-coded status. Optional: pip install internetarchive[ui]. Reuses TUIState from curses TUI for consistent state tracking. Falls back to curses if rich is not installed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --ui/--no-ui flags and _select_ui() to resolve the UI backend: rich → curses → plain text, based on TTY detection and user flags. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The TUI displays progress via its own event system. Without this, Item.download(verbose=True) prints per-file lines to stdout that scroll over the TUI display. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two critical bugs: 1. Every UIEvent had worker=0 hardcoded, so the TUI showed only one worker active while the rest appeared idle. Fix: map thread IDs to stable worker indices (0..N-1). 2. estimate_size() was called synchronously in the main submission loop, making an HTTP call per item that serialized the pipeline. Workers starved while waiting for the next item to be submitted. Fix: move estimate_size + disk routing into worker threads so items are submitted as fast as semaphore slots open. Also removed a duplicate estimate_size() call in _run_one that doubled the HTTP requests per item. UI improvements: - Summary shows elapsed time and throughput (bytes/s) - Worker rows show estimated item size and per-item elapsed time - Recent items show elapsed time, file count, and size - Items counter now shows total done (completed+failed+skipped) - Increased refresh rate to 4/s Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Owner
Author
|
Closing in favor of a fresh, incremental approach. The bulk download engine and TUI will be split into smaller, focused PRs. Code preserved locally on |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ia downloadwith concurrent workers, job logging, multi-disk routing, and resume support--workers,--joblog,--destdirs,--disk-margin,--no-disk-check,--status,--verifypip install internetarchive[ui])BaseWorkerABC) designed for future bulk upload/metadata supportTest plan
ruff checkcleania download --search '...' -w 4 --joblog test.jsonl--joblog--statusand--verifyflags--destdirs🤖 Generated with Claude Code