Skip to content

Orchestrator deadlock in executing with stale local:// assignment causes repeated create_goal BLOCKED sleep loop #259

@bpnace

Description

@bpnace

Summary

The agent gets stuck with an active goal in executing when a task remains assigned to a dead local worker (local://...).
The parent loop then repeatedly tries create_goal, receives BLOCKED, and goes to backoff sleep, while progress does not move.

Observed behavior

  • Active goal remains active (example: 01KJXDGJ2212GXQDGVGE4H9WD8)
  • orchestrator.state.phase = executing
  • task_graph for active goal:
    • assigned: 1 (assigned to dead local://...)
    • blocked: 9
  • children table still has the same local:// worker as running
  • Logs repeat patterns like:
    • Recovering stale task from dead worker
    • then parent does create_goal -> BLOCKED
    • then sleeps with exponential backoff

Expected behavior

  1. Dead local:// worker assignment is recovered once and does not re-enter the same stale loop.
  2. Stale local worker rows are marked dead/unhealthy and excluded from reassignment.
  3. Parent loop should not attempt create_goal while active goal is in progress.
  4. Goal execution should resume (reassign to live worker or self-assignment fallback), not repeatedly sleep.

Reproduction (high-level)

  1. Start with active goal in executing.
  2. Ensure one task is assigned to a local:// worker that no longer exists.
  3. Restart runtime.
  4. Observe repeated stale-recovery + create_goal BLOCKED + backoff sleep loop.

Notes

This appears separate from planning/classifying fallback logic; here tasks exist but assignment targets stale local worker state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions