Skip to content

v0.8.66: Remove engine/TUI channel backpressure from sub-agent status storms #3802

Description

@Hmbown

Problem

High sub-agent fanout can put pressure on the bounded engine event channel and bounded engine op channel. Because many paths use .send().await, the engine or TUI event loop can stall when the receiver is not draining fast enough.

Parent: #3800

Verified evidence

  • crates/tui/src/core/engine.rs: tx_op/rx_op is mpsc::channel(32) and tx_event/rx_event is mpsc::channel(256).
  • EngineHandle::send awaits self.tx_op.send(op).
  • crates/tui/src/tui/ui.rs: after an event drain batch, Op::ListSubAgents is sent via engine_handle.send(...).await.
  • Many engine status/tool events use tx_event.send(...).await.
  • Sub-agent completion UI events use try_send, which avoids blocking but can drop the UI completion event under event-channel pressure.

Critical analysis

The try_send(Event::AgentComplete) drop does not necessarily lose parent-turn completion, because parent completion signaling has another path. The bug is still serious: the UI/status stream can fail to converge under pressure, and the engine/TUI can backpressure each other through awaited sends.

Desired behavior

Sub-agent status storms should be coalesced or degraded without blocking input polling or parent-turn progress. Critical correctness events must be durable or recoverable; noncritical status refreshes can be lossy if the next state snapshot repairs them.

Suggested implementation options

  • Add nonblocking/coalesced send paths for refresh-style ops such as ListSubAgents.
  • Classify engine events into critical vs refresh/status. Critical events should not be silently dropped; refresh/status events should coalesce to latest state.
  • Add backlog metrics/logs for op/event channel occupancy around sub-agent fanout.
  • Consider increasing channel capacity only after coalescing; capacity alone is not a fix.

Acceptance criteria

  • ListSubAgents refresh requests cannot block the TUI event loop when tx_op is full.
  • High-frequency status events do not block the engine indefinitely on tx_event.send().await.
  • Completion/state convergence is verified even when UI refresh events are dropped/coalesced.
  • Tests or a harness simulate full op/event channels during sub-agent fanout.
  • The 20-agent release gate in v0.8.66: Release gate for multi sub-agent fanout freeze #3800 remains responsive while event volume spikes.

Security / policy guardrails

Backpressure fixes must classify events by criticality:

  • Critical events must not be silently dropped: approval prompts, tool call completions/results, fatal errors, user-input requests, cancellation, parent completion, and terminal sub-agent state.
  • Refresh/status events may be coalesced or skipped only when a later snapshot repairs the visible state.
  • If try_send or nonblocking send is used, dropped events need bounded diagnostics so support can distinguish intentional coalescing from data loss.
  • Increasing channel capacity alone is not a security fix; it can hide unbounded event storms and memory pressure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrelease-blockerMust be fixed before the next releasereliabilityReliability, flaky behavior, retries, fallbacks, and robustnesssubagentsSub-agent orchestration, lifecycle, and completion handlingtuiTerminal UI behavior, rendering, or interactionv0.8.66Targeting v0.8.66

    Projects

    Status
    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions