Skip to content

epic: non-SWE work flows without harness friction (umbrella) #943

@akaszubski

Description

@akaszubski

Mission

Operationalize PROJECT.md's macro alignment with micro flexibility and CLAUDE.md's process applies WHEN doing software development: distinguish SWE from non-SWE work at session start, route the harness accordingly, and let the user see and override what's happening.

Single destination: When the user is doing SWE work, the harness enforces. When they are not, it stays out of the way. The system measures the difference and recalibrates without human intervention.

The disconnect being closed

PROJECT.md says macro alignment with micro flexibility; CLAUDE.md says process applies WHEN doing software development. Implementation today is the opposite: micro-rigid (every Write/Edit gated on path) and macro-blind (no upfront SWE-vs-not detection). 1,577 PreToolUse deny events Apr–May; bypass primitives (touch /tmp/skip_*, AUTONOMOUS_DEV_BYPASS=1, SKIP_AGENT_COMPLETENESS_GATE=1, --skip-review) appear in 7+ sessions.

A 13-class intent classifier (lib/intent_classifier.py, 9 SWE classes from #971 + 4 non-SWE classes from #1023) and per-session mode artifact (/tmp/session_mode_<sid>.json from Phase D #998) already exist. Phase E (#999) wired hooks to consult it — but it ships default-off. Phase 2 (#961, PR #1037) added classifier-gated plan-critic + research skip. The plumbing is built; it needs to land.

Milestones

M0 — Framework reliability (PREREQUISITE)

Without this, telemetry lies, classifier rollout is unsafe, every long pipeline is a crash risk.

Exit criterion: zero /tmp/implement_pipeline_state.json collisions across 14 days; crash-resume works without manual record_agent_completion() calls.

M1 — Hook telemetry observability

Measure before tuning. Every M2/M3/M4 claim is unfalsifiable without this.

Exit criterion: top-5 slowest hooks + top-5 most-blocked gates published; baseline JSONL committed.

M2 — Intent-aware gating: turn it on safely

This is the work. Most of the user's friction lives here.

Phase track (Phase 1 #971 + Phase D #998 + Phase E #999 already shipped):

Hard-floor classifier wrap:

New umbrellas:

Exit criterion: INTENT_CLASSIFIER_ENFORCE=true ships as the default for doc/config/typo/status_query/conversation plus the 4 non-SWE classes from #1023; ≥80% reduction in decision=deny events on non-SWE classes; no regression on intent=implement / security_critical.

M3 — User-gated overrides

Stop silent bypasses. Make blocks visible and auditable.

Exit criterion: silent .bypass file marker deprecated; every override emits a structured audit event; deletion-only diffs no longer require all 9 pipeline agents.

M4 — Self-tuning (closed-loop closer)

The actuator. Without this, "gets better every week" stays aspirational.

Exit criterion: classifier skip-rule changes ship without human approval if benchmark improves; revert if regresses; baseline updated on success.

M5 — Real-world use validation

Prove it works for what the user actually wants it to do.

Exit criterion: ≥7 consecutive days of pipeline runs in a non-harness repo with zero manual recovery sessions and zero SKIP_AGENT_COMPLETENESS_GATE=1-class bypasses.

Dependency graph

M0 (framework reliability) ──┐
                             ├──> M2 (intent-aware gating) ──> M3 (overrides) ──> M4 (self-tuning) ──> M5 (real use)
M1 (telemetry baseline)    ──┘                                      ▲                       ▲
                                                                    │                       │
                #1042 hook audit runs in M2 ────────────────────────┘                       │
                #1043 calibration runs in M2 ─────────────────────────────────────────── ───┘

M0 and M1 can run in parallel. Everything else is strictly sequenced.

Out of scope (parallel track, not blocking)

These are internal cleanup, not on the friction critical path. They proceed in parallel.

Original empirical analysis

(Preserved from original issue body — still relevant evidence.)

Empirical analysis of 132 archived sessions over 7 days showed hooks fire on every Claude Code invocation as if it might be /implement, regardless of actual user intent. Strategic, exploratory, and triage tasks accumulated hundreds of blocks per session.

Top 8 most-blocked sessions:

Session Prompt Blocks
09a1e592 (local-command-caveat session) 243
e7fd9310 "bug? from other claude" 192
1bc3fea4 "what open issues do we want to work on in order of priority" 175
3c57a2ee (local-command-caveat session) 159
44114e96 /implement #784 105
0be49a64 /implement --batch --issues 851,863,860 103
2502327a "what open issues do we have?" 101
46261312 "look at this claude session in realign. why do i need to run critic..." 97

Sessions whose first prompt is not /implement accumulate hundreds of blocks. The classifier exists and works; it just isn't consulted at the friction-critical sites.

Current status (2026-05-04)

Milestone Status
M0 In planning — #1041 umbrella created today
M1 Scoped — #1012, #1022 ready
M2 Phase 2 (#961) PR #1037 open; #1042 + #1043 umbrellas created today
M3 Scoped — 6 issues
M4 Scoped — #964 + #1026
M5 Scoped — #1044 umbrella created today

Next concrete move: land #1037 (Phase 2), then start M0 (#1041 + #1036).

Metadata

Metadata

Assignees

No one assigned

    Labels

    epicUmbrella issue tracking a multi-issue workstreampipelinePipeline completeness and reliability

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions