-
Notifications
You must be signed in to change notification settings - Fork 88
Open
Labels
Description
Follow-up items from #456 review (Nabin, Greptile, Codex). All non-blocking polish.
Progress bar / reporting
-
ProgressSnapshotdataclass - Replace the 8-tuple returned byProgressTracker.get_snapshot()with a named frozen dataclass. Callers currently unpack with positional_-prefixed throwaways. - Gate
StickyProgressBaron reporter existence -async_scheduler.pycreates aStickyProgressBareven when there are no CELL_BY_CELL columns (no reporter). Skip creation when reporter isNone. - Type
_make_wrapperproperly -_wrapped_handlersinStickyProgressBaris typed aslist[tuple[StreamHandler, object]]. UseCallable[[logging.LogRecord], None]for the emit reference. - Cache terminal size in
_redraw-shutil.get_terminal_size()is a syscall called on every bar update. Cache with a short TTL under high throughput. -
_compute_stats_widthrate overflow - The sample string uses9999.9 rec/swhich could be exceeded at very high throughput. Low priority. - Race in
StickyProgressBar.__exit__- After_clear_bars()releases the lock,_activeis stillTrueand handlers are still wrapped. A concurrent log emit can re-_redraw()bars that were just cleared, leaving ghost lines on the terminal. Fix: set_active = Falseinside the lock and add an_activeguard in_redraw()so threads that sneak past the lock after teardown don't redraw.
Scheduler accounting
- Double-counted skips on non-retryable seed failure (low likelihood) - When a seed task fails non-retryably,
_execute_task_inner_implcalls_drop_row_groupwhich records skips for all CELL_BY_CELL columns via_record_skipped_tasks_for_row. Then_run_seeds_complete_checkfires (becauseis_column_complete_for_rgcounts dropped rows as done) and calls_record_skipped_tasks_for_rowagain for the same rows - theis_completeguard only checks_completed, not_dropped, so skips are double-counted. Fix: snapshot dropped rows beforeon_seeds_completeand only record skips for newly-dropped rows. Unlikely in practice - seed columns are typically samplers or simple from_scratch generators that rarely hit LLM APIs. Non-retryable errors (validation, parsing, auth) are config bugs caught during development, not runtime transients. The common failure modes (rate limits, timeouts, 500s, connection errors) are all classified as retryable.
Scheduler lifecycle
- Straggler logs after early shutdown - When
_early_shutdownis triggered,_main_dispatch_loopsalvages deferred tasks and checkpoints, then breaks. But workers that were already in-flight before the flag was set are never cancelled or awaited -_cancel_workers()is only called in theCancelledErrorpath, not the normal early-shutdown exit. Those orphaned worker coroutines continue running (finishing HTTP requests, retrying, etc.) and their log calls trickle in afterlog_final()/ progress bar teardown. Fix: after_main_dispatch_loopreturns, call_cancel_workers()(or at minimum await remaining workers) beforelog_final()so no stragglers outlive the scheduler.
Slow request diagnostics
- Warn on slow HTTP requests - Add periodic warnings in
HttpModelClient._apostwhen a request has been pending longer than a threshold (e.g. 30s). Useful for diagnosing streaming responses that trickle data without timing out. - Scheduler stall warning - Add a timeout on
_wake_event.wait()in the main dispatch loop that logs a warning with in-flight/active/deferred counts when no progress is made for 30s.
Reactions are currently unavailable