Skip to content

Add bulk download engine with concurrent workers#743

Closed
jjjake wants to merge 17 commits intomasterfrom
bulk-download
Closed

Add bulk download engine with concurrent workers#743
jjjake wants to merge 17 commits intomasterfrom
bulk-download

Conversation

@jjjake
Copy link
Owner

@jjjake jjjake commented Feb 7, 2026

Summary

  • Add built-in bulk download engine to ia download with concurrent workers, job logging, multi-disk routing, and resume support
  • New CLI flags: --workers, --joblog, --destdirs, --disk-margin, --no-disk-check, --status, --verify
  • Three UI backends: plain text (default), curses TUI (stdlib), Rich TUI (optional via pip install internetarchive[ui])
  • Operation-agnostic architecture (BaseWorker ABC) designed for future bulk upload/metadata support

Test plan

  • 214 new tests all passing (unit, integration, CLI)
  • No regressions in existing test suite
  • ruff check clean
  • Pre-commit hooks (black, mypy, codespell) all pass
  • Manual test with real ia download --search '...' -w 4 --joblog test.jsonl
  • Verify resume works: interrupt and re-run with same --joblog
  • Test --status and --verify flags
  • Test multi-disk with --destdirs

🤖 Generated with Claude Code

jjjake and others added 17 commits February 9, 2026 10:26
Foundation for the bulk operations framework. BaseWorker defines
the interface that operation-specific workers (download, upload,
etc.) must implement. WorkerResult and VerifyResult are the
return types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Optional callback invoked with byte count after each chunk
write. Defaults to None (no-op). Enables TUI progress bars
in the bulk download engine without changing existing behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Thread-safe, append-only JSONL job log for tracking bulk download
progress. Supports resume via should_skip() with rules: completed
and permanently-skipped items are skipped on re-run, failed and
retryable items are retried. Includes status() summary and context
manager protocol.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Manages multiple destination directories with free space detection
via os.statvfs, configurable margin, reservation tracking to prevent
over-commit, and mark_full() for ENOSPC recovery. Thread-safe via
Lock. Includes parse_size() helper for human-readable sizes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Plain text UI for non-TTY environments. UIEvent dataclass carries
event data (kind, identifier, progress, errors). PlainUI dispatches
events to handlers with timestamped output format
[HH:MM:SS] [idx/total] identifier: message. Includes print_summary()
and _format_bytes() helper.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Operation-specific worker for bulk downloads. Uses per-thread
sessions for thread safety. Implements estimate_size, execute,
and verify from BaseWorker interface. Handles dark items, exceptions,
and accepts both str and Path for destdir.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Core engine that ties together worker, job log, disk pool, and UI.
Supports concurrent execution via ThreadPoolExecutor with semaphore-
gated submission, resume from joblog, job-level retry, graceful
shutdown via request_stop(), and pause/resume flow control.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
16 tests covering DownloadWorker.verify() including complete items,
missing files, no item directory, empty items, error handling,
subdirectory files, and result field validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Curses-based fallback TUI. Separated state tracking (TUIState,
fully testable) from rendering (CursesTUI). Shows worker status
with progress bars, item counts, and recent completions. Runs
renderer in dedicated daemon thread.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds --workers, --joblog, --destdirs, --disk-margin, --no-disk-check,
--status, and --verify flags to ia download. Dispatches to bulk engine
when --workers > 1 with multi-item input (--search, --itemlist, stdin).
Existing single-item download path unchanged. Includes commands.py with
bulk_download, bulk_status, and bulk_verify entry points.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests 3 items with 2 workers using mocked HTTP, verifies joblog
entries, on-disk files, resume semantics, and single-worker mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New bulk-download.rst with comprehensive guide covering CLI options,
job log format, multi-disk routing, resume semantics, verification,
and troubleshooting. Updated cli.rst with bulk download section and
examples. Updated index.rst and parallel.rst with cross-references.
Added HISTORY.rst entry for the feature.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rich-based TUI with styled output, progress bars, and color-coded
status. Optional: pip install internetarchive[ui]. Reuses TUIState
from curses TUI for consistent state tracking. Falls back to curses
if rich is not installed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --ui/--no-ui flags and _select_ui() to resolve the
UI backend: rich → curses → plain text, based on TTY
detection and user flags.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The TUI displays progress via its own event system. Without
this, Item.download(verbose=True) prints per-file lines to
stdout that scroll over the TUI display.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two critical bugs:

1. Every UIEvent had worker=0 hardcoded, so the TUI showed only
   one worker active while the rest appeared idle.
   Fix: map thread IDs to stable worker indices (0..N-1).

2. estimate_size() was called synchronously in the main submission
   loop, making an HTTP call per item that serialized the pipeline.
   Workers starved while waiting for the next item to be submitted.
   Fix: move estimate_size + disk routing into worker threads so
   items are submitted as fast as semaphore slots open.

Also removed a duplicate estimate_size() call in _run_one that
doubled the HTTP requests per item.

UI improvements:
- Summary shows elapsed time and throughput (bytes/s)
- Worker rows show estimated item size and per-item elapsed time
- Recent items show elapsed time, file count, and size
- Items counter now shows total done (completed+failed+skipped)
- Increased refresh rate to 4/s

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jjjake
Copy link
Owner Author

jjjake commented Feb 10, 2026

Closing in favor of a fresh, incremental approach. The bulk download engine and TUI will be split into smaller, focused PRs. Code preserved locally on bulk-download-archive branch.

@jjjake jjjake closed this Feb 10, 2026
@jjjake jjjake deleted the bulk-download branch February 10, 2026 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments