Skip to content

chore: progress bar and scheduler polish follow-ups from PR #456 #462

@andreatgretel

Description

@andreatgretel

Follow-up items from #456 review (Nabin, Greptile, Codex). All non-blocking polish.

Progress bar / reporting

  • ProgressSnapshot dataclass - Replace the 8-tuple returned by ProgressTracker.get_snapshot() with a named frozen dataclass. Callers currently unpack with positional _-prefixed throwaways.
  • Gate StickyProgressBar on reporter existence - async_scheduler.py creates a StickyProgressBar even when there are no CELL_BY_CELL columns (no reporter). Skip creation when reporter is None.
  • Type _make_wrapper properly - _wrapped_handlers in StickyProgressBar is typed as list[tuple[StreamHandler, object]]. Use Callable[[logging.LogRecord], None] for the emit reference.
  • Cache terminal size in _redraw - shutil.get_terminal_size() is a syscall called on every bar update. Cache with a short TTL under high throughput.
  • _compute_stats_width rate overflow - The sample string uses 9999.9 rec/s which could be exceeded at very high throughput. Low priority.
  • Race in StickyProgressBar.__exit__ - After _clear_bars() releases the lock, _active is still True and handlers are still wrapped. A concurrent log emit can re-_redraw() bars that were just cleared, leaving ghost lines on the terminal. Fix: set _active = False inside the lock and add an _active guard in _redraw() so threads that sneak past the lock after teardown don't redraw.

Scheduler accounting

  • Double-counted skips on non-retryable seed failure (low likelihood) - When a seed task fails non-retryably, _execute_task_inner_impl calls _drop_row_group which records skips for all CELL_BY_CELL columns via _record_skipped_tasks_for_row. Then _run_seeds_complete_check fires (because is_column_complete_for_rg counts dropped rows as done) and calls _record_skipped_tasks_for_row again for the same rows - the is_complete guard only checks _completed, not _dropped, so skips are double-counted. Fix: snapshot dropped rows before on_seeds_complete and only record skips for newly-dropped rows. Unlikely in practice - seed columns are typically samplers or simple from_scratch generators that rarely hit LLM APIs. Non-retryable errors (validation, parsing, auth) are config bugs caught during development, not runtime transients. The common failure modes (rate limits, timeouts, 500s, connection errors) are all classified as retryable.

Scheduler lifecycle

  • Straggler logs after early shutdown - When _early_shutdown is triggered, _main_dispatch_loop salvages deferred tasks and checkpoints, then breaks. But workers that were already in-flight before the flag was set are never cancelled or awaited - _cancel_workers() is only called in the CancelledError path, not the normal early-shutdown exit. Those orphaned worker coroutines continue running (finishing HTTP requests, retrying, etc.) and their log calls trickle in after log_final() / progress bar teardown. Fix: after _main_dispatch_loop returns, call _cancel_workers() (or at minimum await remaining workers) before log_final() so no stragglers outlive the scheduler.

Slow request diagnostics

  • Warn on slow HTTP requests - Add periodic warnings in HttpModelClient._apost when a request has been pending longer than a threshold (e.g. 30s). Useful for diagnosing streaming responses that trickle data without timing out.
  • Scheduler stall warning - Add a timeout on _wake_event.wait() in the main dispatch loop that logs a warning with in-flight/active/deferred counts when no progress is made for 30s.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions