diff --git a/.github/workflows/installers.yml b/.github/workflows/installers.yml index 7c8d595..7e667e9 100644 --- a/.github/workflows/installers.yml +++ b/.github/workflows/installers.yml @@ -84,6 +84,17 @@ jobs: done exit $rc + # 1d) Framework-compliance checks for the dev-team plugin: skill:// anchor + # resolution, index.json file integrity, review-agent output-discipline + # wiring, and the test-after stance (no removed TDD identifiers). + compliance: + name: dev-team framework compliance + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - name: Validate framework compliance + run: node scripts/ci-framework-compliance.mjs + # 2) Real end-to-end install on every OS (bootstraps bun/omp/plugins, then # verifies OMP launches and lists all plugins). e2e: diff --git a/README.fr.md b/README.fr.md index a991c34..8d7ec16 100644 --- a/README.fr.md +++ b/README.fr.md @@ -8,7 +8,7 @@ met en place OMP et vous guide à travers chacun d'eux. | Plugin | Rôle | |---|---| -| **[`dev-team`](plugins/dev-team/)** | **Équipe de dev agentique** — un orchestrateur + 32 agents spécialistes/critiques, le workflow `/specs` → `/plan` → `/build` → `/pr`, **TDD strict** et points de contrôle humains, ~78 skills, et des extensions « garde-fou » bloquantes. Portage de [bdfinst/agentic-dev-team](https://github.com/bdfinst/agentic-dev-team) (Bryan Finster). Tiers 100 % cloud ; gardez le tier « small » à haut volume bon marché. | +| **[`dev-team`](plugins/dev-team/)** | **Équipe de dev agentique** — un orchestrateur + 32 agents spécialistes/critiques, le workflow `/specs` → `/plan` → `/build` → `/pr`, un **plan gate strict** (test-after, tests requis) et points de contrôle humains, ~78 skills, et des extensions « garde-fou » bloquantes. Portage de [bdfinst/agentic-dev-team](https://github.com/bdfinst/agentic-dev-team) (Bryan Finster). Tiers 100 % cloud ; gardez le tier « small » à haut volume bon marché. | | **[`copilot-preset`](plugins/copilot-preset/)** | **Préréglage modèles GitHub Copilot** — route OMP (et les tiers de dev-team) via `github-copilot` pour tourner sur une licence Copilot. Config seulement : mapping tier→modèle, comparatif tarifaire (crédits IA post-juin 2026), et MAI-Code-1-Flash câblé. | | **[`token-diet`](plugins/token-diet/)** | **Réduction agressive des tokens** — ctx-wire (compression transparente de la sortie des commandes + scrub des secrets), CodeGraph (requêtes de graphe de symboles via MCP au lieu de grep+read), un skill « caveman » de sortie laconique, et un skill « yagni » de code minimal — par-dessus la compaction/`astGrep` natives d'OMP. | | **[`azure-devops-fs`](plugins/azure-devops-fs/)** | **Azure DevOps comme un système de fichiers** — lecture repos/fichiers/PR/diffs via URIs `ado://` (paginé), **gates/policies** de PR + CI (builds/logs/run), création/checkout/push/complete de PR, commentaires/votes. Propulsé par l'**Azure CLI** (`az` + extension azure-devops), auth PAT, cache SQLite ; fonctionne derrière les proxys TLS d'entreprise. | diff --git a/README.md b/README.md index 73b67f3..ea689e2 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ you through them. | Plugin | What it does | |---|---| -| **[`dev-team`](plugins/dev-team/)** | **Agentic dev team** — orchestrator + 32 specialist/critic agents, the `/specs` → `/plan` → `/build` → `/pr` workflow, **strict TDD** and human gates, ~78 skills, and blocking guard extensions. Port of [bdfinst/agentic-dev-team](https://github.com/bdfinst/agentic-dev-team) (Bryan Finster). All-cloud tiers; keep the high-volume small tier cheap. | +| **[`dev-team`](plugins/dev-team/)** | **Agentic dev team** — orchestrator + 32 specialist/critic agents, the `/specs` → `/plan` → `/build` → `/pr` workflow, a **forced plan gate** (test-after, tests required) and human gates, ~78 skills, and blocking guard extensions. Port of [bdfinst/agentic-dev-team](https://github.com/bdfinst/agentic-dev-team) (Bryan Finster). All-cloud tiers; keep the high-volume small tier cheap. | | **[`copilot-preset`](plugins/copilot-preset/)** | **GitHub Copilot model preset** — route OMP (and the dev-team tiers) through `github-copilot` to run on a Copilot license. Config-only: tier→model mapping, post-June-2026 AI-credit pricing comparison, and MAI-Code-1-Flash wired in. | | **[`token-diet`](plugins/token-diet/)** | **Aggressive token reduction** — ctx-wire (transparent command-output compression + secret scrub), CodeGraph (MCP symbol/call-graph queries instead of grep+read), a caveman terse-output skill, and a yagni minimal-code skill — layered on OMP's native compaction/`astGrep`. | | **[`azure-devops-fs`](plugins/azure-devops-fs/)** | **Azure DevOps as a filesystem** — read repos/files/PRs/diffs via `ado://` URIs (paginated), PR **gates/policies** + CI (builds/logs/run), create/checkout/push/complete PRs, comment/vote. Backed by the **Azure CLI** (`az` + the azure-devops extension), PAT auth, SQLite read cache; works behind corporate TLS proxies. | diff --git a/REVIEW.md b/REVIEW.md index 739cc53..3f67a63 100644 --- a/REVIEW.md +++ b/REVIEW.md @@ -24,7 +24,7 @@ cohérence docs/registres — pas des cassures de build.** ### 1. Les « guards » de sécurité sont du théâtre de sécurité (le plus important) -Les 6 guards de `dev-team` (destructive, path, freeze, tdd, review-gate, +Les 6 guards de `dev-team` (destructive, path, freeze, spec, review-gate, careful) sont des hooks *PreToolUse consultatifs* basés sur du matching de sous-chaîne / glob. Ils donnent une **fausse confiance** : ils sont contournables trivialement, par accident comme volontairement. @@ -34,7 +34,7 @@ contournables trivialement, par accident comme volontairement. | destructive-guard | `rm -rf`, drop/truncate, force-push, kill… | `find -delete`, `git clean -fdx`, `> f`, `truncate -s0`, `bash -c …`, obfuscation par variable ; **warn-only hors `/careful on`** ; la SAFE-list court-circuite tout (`rm -rf node_modules/../../etc`) | Très faible | | path-guard | édition de `.env`/`*.pem`/`*.key`/`id_rsa`… | aucune couverture `bash` (`tee`/`>`/`sed -i`) ; regex **sensible à la casse** → `ID_RSA`/`.PEM` passent ; lectures non gardées ; pass silencieux si le shape d'edit ne matche pas | Faible | | freeze-guard | écriture sur globs gelés | **aucune branche `bash`** ; écraser `.omp/state/freeze.json` suffit | Faible | -| tdd-guard (.feature) | modif de specs BDD | `BASH_WRITE_RE` étroit (rate `python -c`/`ed`/heredoc) ; opt-out `OMP_ALLOW_FEATURE_EDITS=1` | Faible–moyenne | +| spec-guard (.feature) | modif de specs BDD | `BASH_WRITE_RE` étroit (rate `python -c`/`ed`/heredoc) ; opt-out `OMP_ALLOW_FEATURE_EDITS=1` | Faible–moyenne | | review-gate | `git commit` avant approve | `--no-verify` **explicitement autorisé** ; `bash -c 'git commit'` ; le `--no-verify` est détecté par `includes` donc un message de commit le déclenche | Faible | | careful-mode | active le blocage | fichier d'état modifiable par l'agent ; **OFF par défaut** | Faible | diff --git a/docs/upstream-v7.7-7.9-extraction.md b/docs/upstream-v7.7-7.9-extraction.md new file mode 100644 index 0000000..e932233 --- /dev/null +++ b/docs/upstream-v7.7-7.9-extraction.md @@ -0,0 +1,105 @@ +# Extraction from upstream agentic-dev-team (v7.7–v7.9) + +Survey of `bdfinst/agentic-dev-team` releases since our v7.6 extraction, and what +we pulled into our OMP port — filtered through omp-dev-team's choices: +**test-after with refactoring (no TDD), quality first, cost efficiency.** + +## Upstream evolution since v7.6 + +| Release | Highlight | +|---|---| +| v7.7.0 | harness fixes from session-review/audit; **closed learning loop**; **when-tdd-pays** experiment fixtures; ambiguity-resolution protocol for `/specs` + `/ship` gate | +| v7.8.0 | **craftsmanship-axis review rules** — use-the-platform, comment hygiene; named shipped AC references | +| v7.9.0 | **deterministic status + finding grouping** for doc/naming review agents | + +## Extracted (respecting omp choices) + +1. **Deterministic status + finding grouping (v7.9).** New shared knowledge file + `skills/dev-team-knowledge/review-output-discipline.md`: + - **Deterministic status** — an agent's `status` is a pure function of the + highest-severity finding, never of volume. + - **Finding grouping** — Enumerate → Classify → Group; consolidate same-kind + findings into ~3–5 concept-level findings per file; keep `error` findings + individual. + + Wired as a one-line anchored reference into **all 17** finding-emitting review + agents (upstream changed only doc/naming — we factored it into one shared file + instead of copy-pasting; this is the DRY, cost-efficient win and propagates + determinism + token savings to every review). Added to `index.json`. + +2. **Comment hygiene (v7.8) → `doc-review`.** Tracker-ID references in shipped + comments (`JIRA-123`, `#456`), detached/orphaned doc comments; **capped at + `suggestion`** (never raise status above `warn`); durable external standards + (`RFC-2119`, `ISO-4217`, CVE) are explicitly not flagged. + +3. **Use-the-platform (v7.8) → `refactor-opportunity-review`.** Reinvented + built-ins (`min`/`max`/`sum`/`clamp`/`copy`), reinvented helpers, open-coded + idioms repeated 3+ times — mapped by concept, honoring language **and version** + (e.g. Go <1.21 has no builtin `min`/`max`). Framing de-TDD'd to test-after. + +4. **Closed learning loop (v7.7) → `feedback-learning` skill.** We already had + post-task reflection + recurring-correction detection (3+); the missing + "closed" half was a persistent queue + disposition. Added a + `metrics/pending-review.jsonl` queue (system proposals enqueued, never + dropped) and a **session-review** disposition flow (`review` keyword) that + previews, then approves (apply + log + stamp `approved`) or rejects (stamp + `rejected`). Project-local only — plugin-cache-safe. Asynchronous/batched by + design, which keeps it cheap. + +## Test-after reinforcement (omp north star) + +Beyond extraction, a pass to make **test-after with refactoring** explicit and to +remove residual TDD framing the earlier plan-gate-over-tdd move had left behind: + +- **Refactor after green, every step** — promoted from "optional" to a deliberate + always-taken pass (the `refactor-opportunity-review` lens) in `skills/build`, + `prompts/implementer.md`, and the orchestrator's Phase 3. Changes are made only + when there's a real opportunity, but the pass is always taken — the *refactoring* + half of test-after-with-refactoring. +- **Residual TDD traces reframed** to test-after: `triage` skill + command (RED/GREEN + fix plan → regression-test + fix + refactor), `qa-engineer` (ATDD → acceptance + scenarios; unit tests follow, test-after), `plan` (TDD step/traceability → + build step / step-to-scenario), `progress-guardian` (flagged "tests not written + first" → flags missing tests, order-agnostic), `mutation-testing` / + `quality-gate-pipeline` / `init-dev-team` (RED-GREEN labels dropped, semantics + kept), `test-design-reviewer` ("First/written-first" rubric → "Timely/ships with + the implementation"), plus the root `README`/`README.fr` ("strict TDD" → forced + plan gate, test-after) and `REVIEW.md` (stale `tdd-guard` → `spec-guard`). + +## Deliberately NOT extracted (with rationale) + +- **when-tdd-pays experiment fixtures (v7.7).** Upstream is re-litigating where + test-first pays. omp-dev-team has already made the call (test-after + plan gate); + importing TDD experiment fixtures would reintroduce exactly what we removed. +- **`/ship` gate + ambiguity-resolution protocol (v7.7).** The ambiguity protocol + is reasonable, but it is wired to a `/ship` command we don't have; our `/specs` + already runs a consistency gate. Candidate for a focused follow-up if a gap shows. +- **Bibliographic TDD citations kept as-is.** e.g. `testability-patterns.md` cites + *Growing Object-Oriented Software, Guided by Tests* ("outside-in TDD") — that is + the book's actual subject; rewriting a citation would misrepresent the source. + +## Verified + +`ci-validate-json` 23/23 · all 10 dev-team extensions compile · unit suite green · +both `review-output-discipline` anchors (`#deterministic-status`, +`#finding-grouping`) resolve from all 17 wired agents · no prescriptive TDD / +test-first / RED-GREEN traces remain outside historical `docs/` and the one book +citation. + +## CI: framework-compliance checks + +What earlier extraction PRs verified by hand is now enforced by CI — +`scripts/ci-framework-compliance.mjs` (pure Node, new `compliance` job): + +- **Anchor resolution** — every `skill://dev-team-knowledge/.md#` + reference resolves (file exists; anchor matches a heading slug or a registered + `index.json` anchor). Catches exactly the rename/typo failure mode this work + risked across 17 agents. +- **index.json integrity** — every keyed file exists. +- **Review-agent wiring** — every finding-emitting agent references + `review-output-discipline.md` (allowlist: `progress-guardian`). +- **Test-after stance** — the deliberately-removed TDD identifiers (`tdd-first`, + `tdd-guard`, `test-driven-development`, `RED-GREEN`) can't creep back in, + outside a small rationale/historical allowlist. + +Currently: 26 anchor refs + 39 index files checked — 0 violations. diff --git a/plugins/dev-team/.claude-plugin/plugin.json b/plugins/dev-team/.claude-plugin/plugin.json index 430bf3c..e4d7ff9 100644 --- a/plugins/dev-team/.claude-plugin/plugin.json +++ b/plugins/dev-team/.claude-plugin/plugin.json @@ -1,6 +1,6 @@ { "name": "dev-team", - "version": "1.2.0", + "version": "1.3.0", "description": "Agentic dev team for Oh-My-Pi (ported from bdfinst/agentic-dev-team): orchestrator + 32 specialist/critic agents, the /specs -> /plan -> /build -> /pr workflow, a forced plan gate (scope -> plan -> build -> review) with tests required, human gates, and blocking guard extensions.", "author": { "name": "outofrange-consulting" }, "license": "MIT", diff --git a/plugins/dev-team/agents/a11y-review.md b/plugins/dev-team/agents/a11y-review.md index d8b35c2..c25f5e7 100644 --- a/plugins/dev-team/agents/a11y-review.md +++ b/plugins/dev-team/agents/a11y-review.md @@ -86,6 +86,10 @@ Focus management: Code style, naming, test coverage, performance (handled by other agents) +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#a11y-review` (the shared challenger loop + the a11y-review challenge questions; ≤3 rounds). Append a confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/arch-review.md b/plugins/dev-team/agents/arch-review.md index 4ec6775..ab217e4 100644 --- a/plugins/dev-team/agents/arch-review.md +++ b/plugins/dev-team/agents/arch-review.md @@ -95,6 +95,10 @@ Grep for patterns that architecture documentation explicitly bans: - Direct `fetch`/`axios`/`HttpClient` calls outside designated HTTP adapter layer - Direct DB client calls outside designated repository layer +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#arch-review` (arch-review challenge questions). Append confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/complexity-review.md b/plugins/dev-team/agents/complexity-review.md index 49e77d3..b08e54d 100644 --- a/plugins/dev-team/agents/complexity-review.md +++ b/plugins/dev-team/agents/complexity-review.md @@ -67,6 +67,10 @@ Cognitive load: - Too many concepts per function - Non-obvious control flow +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#structure-review`. Use the structure-review challenge questions (the nearest applicable section — no complexity-specific section exists). Append confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/concurrency-review.md b/plugins/dev-team/agents/concurrency-review.md index c3228ef..6fbf5f1 100644 --- a/plugins/dev-team/agents/concurrency-review.md +++ b/plugins/dev-team/agents/concurrency-review.md @@ -89,6 +89,10 @@ Resource ordering: Code style, naming, domain modeling, security, complexity (handled by other agents) +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#concurrency-review` (the shared challenger loop + the concurrency-review challenge questions; ≤3 rounds). Append a confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/doc-review.md b/plugins/dev-team/agents/doc-review.md index 0049db4..d4b2981 100644 --- a/plugins/dev-team/agents/doc-review.md +++ b/plugins/dev-team/agents/doc-review.md @@ -64,11 +64,23 @@ Return `{"status": "skip", "issues": [], "summary": "No documentation files foun - `docs/agent-architecture.md` references a configuration or governance detail that is no longer current - Agent or skill files changed without corresponding update to `CLAUDE.md` registry tables +### Comment hygiene + +- **Tracker-ID references in shipped comments** — issue/epic/ticket IDs in code comments (`JIRA-123`, `PROJ-789`, `#456`, `closes GH-12`). The comment should explain *intent*; the tracker ID belongs in the commit message, not the source. Flag with a `suggestedFix` that rewrites the comment as a purpose statement. +- **Detached / orphaned doc comments** — a JSDoc/docstring/XML-doc block separated from its symbol by blank lines or other statements, or attached to the wrong symbol (so tooling associates it incorrectly). +- **Do NOT flag durable external standards** — `RFC-2119`, `ISO-4217`, `RFC 5322`, CVE IDs, and similar stable references are legitimate; they are not tracker IDs. + +Comment-hygiene and tracker-ID findings are **capped at `suggestion`**: on their own they never raise status above `warn`. + ## Ignore Code correctness, naming conventions, test quality (handled by other agents) Doc style preferences (sentence case vs title case, oxford comma) — flag only when docs are wrong, not when they differ in style +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#doc-review` (the shared challenger loop + the doc-review challenge questions; ≤3 rounds). Append a confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/domain-review.md b/plugins/dev-team/agents/domain-review.md index 7a9d043..9ebcffa 100644 --- a/plugins/dev-team/agents/domain-review.md +++ b/plugins/dev-team/agents/domain-review.md @@ -86,6 +86,10 @@ Whole-file load: each linked skill is loaded in full when invoked. - [Ubiquitous Language](skill://ubiquitous-language) — invoke when the user asks to "build the glossary", "extract domain terms", or "document the ubiquitous language". Also invoke when domain-review findings show pervasive terminology inconsistency (3+ different names for the same concept across the codebase). +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#domain-review` (domain-review challenge questions). Append confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/js-fp-review.md b/plugins/dev-team/agents/js-fp-review.md index 1794b11..4a674f6 100644 --- a/plugins/dev-team/agents/js-fp-review.md +++ b/plugins/dev-team/agents/js-fp-review.md @@ -76,6 +76,10 @@ Impure patterns: Code structure, naming, tests, domain modeling, security (handled by other agents) +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#js-fp-review` (the shared challenger loop + the js-fp-review challenge questions; ≤3 rounds). Append a confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/naming-review.md b/plugins/dev-team/agents/naming-review.md index b7852a2..ffddc02 100644 --- a/plugins/dev-team/agents/naming-review.md +++ b/plugins/dev-team/agents/naming-review.md @@ -61,6 +61,10 @@ Consistency: Structure, tests, domain modeling (handled by other agents) +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#naming-review` (the shared challenger loop + the naming-review challenge questions; ≤3 rounds). Append a confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/orchestrator.md b/plugins/dev-team/agents/orchestrator.md index 7ff5355..d9a4599 100644 --- a/plugins/dev-team/agents/orchestrator.md +++ b/plugins/dev-team/agents/orchestrator.md @@ -147,11 +147,12 @@ Every non-trivial task follows three explicit phases. Each phase runs in minimal ### Phase 3: Implement -- **Goal**: Execute the plan. Write code, run tests, verify at each step. +- **Goal**: Execute the plan. Write code, run its tests, verify, and refactor at each step (test-after with refactoring). - **Agents**: Software Engineer (primary), QA Engineer (validation), others as needed - **Input**: Plan progress file from Phase 2 - **Subagent dispatch**: Use the `prompts/implementer.md` template when dispatching implementation subagents via the `task` tool. For parallel implementation of independent units, use `isolation: "worktree"` on the `task` tool to give each subagent its own git worktree — this prevents file conflicts when multiple units are implemented concurrently. -- **Tests required (not test-first)**: every behavior change ships with tests, written in whatever order fits (no enforced RED-GREEN-REFACTOR). A unit is done only when `/impl-verify` reports its strict build + tests green; the orchestrator requires that verdict as evidence (`tests-required` rule). +- **Tests required (not test-first)**: every behavior change ships with tests, written in whatever order fits — order is not enforced. A unit is done only when `/impl-verify` reports its strict build + tests green; the orchestrator requires that verdict as evidence (`tests-required` rule). +- **Refactor after green (every unit)**: once a unit's tests pass, take a deliberate refactor pass — structure, naming, duplication, use-the-platform (the `refactor-opportunity-review` lens) — keeping `/impl-verify` green, before the inline review. This is the *refactoring* half of test-after-with-refactoring; the pass is always taken, changes are made only when there's a real opportunity. - **Output**: Working code that passes all tests, acceptance criteria, and code review - **Three-stage inline review**: After each discrete unit of work completes, run spec-compliance first, then quality, then browser verification for UI changes: 1. **Stage 1 — Spec compliance**: Run `spec-compliance-review` using the `prompts/spec-reviewer.md` template. Does the code match the spec? If fail → fix before proceeding to Stage 2. diff --git a/plugins/dev-team/agents/performance-review.md b/plugins/dev-team/agents/performance-review.md index a31a6d8..274164c 100644 --- a/plugins/dev-team/agents/performance-review.md +++ b/plugins/dev-team/agents/performance-review.md @@ -69,6 +69,10 @@ Algorithmic: Code structure, naming, tests, domain modeling, security, concurrency (handled by other agents) +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#performance-review` (the shared challenger loop + the performance-review challenge questions; ≤3 rounds). Append a confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/progress-guardian.md b/plugins/dev-team/agents/progress-guardian.md index 859a22c..20bd23b 100644 --- a/plugins/dev-team/agents/progress-guardian.md +++ b/plugins/dev-team/agents/progress-guardian.md @@ -36,7 +36,7 @@ Plan adherence: - Steps executed out of order without justification - Steps skipped entirely - Work done that doesn't map to any plan step -- Tests not written before implementation (RED before GREEN) +- A behavior-change step marked done without its tests (`tests-required` — order doesn't matter, presence and a green `/impl-verify` do) Commit discipline: diff --git a/plugins/dev-team/agents/qa-engineer.md b/plugins/dev-team/agents/qa-engineer.md index 3c8e1c3..65742bf 100644 --- a/plugins/dev-team/agents/qa-engineer.md +++ b/plugins/dev-team/agents/qa-engineer.md @@ -1,6 +1,6 @@ --- name: qa-engineer -description: Acceptance test driven development, test generation, quality metrics, and regression testing +description: Acceptance-scenario-based testing, test generation, quality metrics, and regression testing tools: read, search, find, edit, write, bash model: claude-sonnet-4-6 thinking-level: medium @@ -10,7 +10,7 @@ thinking-level: medium ## Technical Responsibilities -- Acceptance test driven development: per-slice Gherkin scenarios (authored in `/plan`) define behavior before implementation begins +- Acceptance scenarios: per-slice Gherkin scenarios (authored in `/plan`) describe the target behavior the build must satisfy; implementation and its unit tests follow (test-after, not test-first) - Test case generation (unit, integration, e2e) derived from the plan's slice scenarios — judged against `skill://dev-team-knowledge/test-automation-principles.md` (goals + principles rubric); DB tests follow `skill://dev-team-knowledge/database-test-patterns.md` (Fake vs real DB, isolation, teardown) - Run `/impl-verify` to execute the stack's strict build + tests and report the bounded PASS/FAIL/HALT verdict as test evidence — never weaken a gate to go green (`no-disable-analyzers`) - Automated testing framework setup and maintenance diff --git a/plugins/dev-team/agents/refactor-opportunity-review.md b/plugins/dev-team/agents/refactor-opportunity-review.md index 8bf8b0b..79e0916 100644 --- a/plugins/dev-team/agents/refactor-opportunity-review.md +++ b/plugins/dev-team/agents/refactor-opportunity-review.md @@ -1,6 +1,6 @@ --- name: refactor-opportunity-review -description: Assesses refactoring opportunities after tests pass (TDD REFACTOR phase), distinguishing semantic duplication from structural similarity +description: Assesses refactoring opportunities after tests pass (the test-after refactoring step), distinguishing semantic duplication from structural similarity tools: read, search, find model: claude-sonnet-4-6 thinking-level: medium @@ -46,6 +46,13 @@ Return `{"status": "skip", "issues": [], "summary": "No refactoring candidates i - Primitive obsession: repeated primitive combinations that should be a type - Dead code: unreachable branches, unused variables, commented-out code +### Use-the-platform (suggestions) + +- **Reinvented built-ins**: hand-rolled `min`/`max`/`sum`/`clamp`/`copy` (and similar) when the language standard library already provides them. Check the language **and version** before flagging (e.g. Go <1.21 has no builtin `min`/`max`; older targets may lack a stdlib helper). +- **Reinvented helpers**: duplicated inline computation when a named function already exists in scope — point to the existing one. +- **Open-coded idioms**: the same non-trivial expression repeated 3+ times inline (e.g. a tolerance comparison) that should be a named predicate/helper. +- Map by *concept*, not syntax — honor language-specific constraints rather than matching tokens. + ### Nice (later) - Structural similarity that isn't semantic duplication (leave alone) @@ -64,7 +71,11 @@ Before flagging duplication, ask: "If the business rule changes, would both copi ## Ignore -Naming (naming-review), test quality (test-review), architecture (arch-review), security (security-review). This agent focuses exclusively on refactoring opportunities within the TDD cycle. +Naming (naming-review), test quality (test-review), architecture (arch-review), security (security-review). This agent focuses exclusively on refactoring opportunities during the refactoring step, once tests pass. + +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). ## Self-Challenge diff --git a/plugins/dev-team/agents/security-review.md b/plugins/dev-team/agents/security-review.md index a992fc0..7cf3166 100644 --- a/plugins/dev-team/agents/security-review.md +++ b/plugins/dev-team/agents/security-review.md @@ -170,6 +170,10 @@ Input: When a finding is an untrusted-input or declared-schema boundary, a `suggestedFix` may cross-reference the matching test technique: parser/deserializer hardening → `skill://dev-team-knowledge/testing-techniques/fuzz.md`; payload-shape conformance → `skill://dev-team-knowledge/testing-techniques/schema-validation.md`. +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#security-review` (security-review challenge questions). Append confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/spec-compliance-review.md b/plugins/dev-team/agents/spec-compliance-review.md index 9332960..b582377 100644 --- a/plugins/dev-team/agents/spec-compliance-review.md +++ b/plugins/dev-team/agents/spec-compliance-review.md @@ -82,6 +82,10 @@ Return `{"status": "skip", "issues": [], "summary": "No spec artifacts found"}` - Plan deviation → `warning` (may be justified) - Cosmetic divergence from the plan that meets every criterion (naming, file placement, ordering) → `suggestion` +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#spec-compliance-review` (the shared challenger loop + the spec-compliance-review challenge questions; ≤3 rounds). Append a confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/structure-review.md b/plugins/dev-team/agents/structure-review.md index d67f041..0f5710a 100644 --- a/plugins/dev-team/agents/structure-review.md +++ b/plugins/dev-team/agents/structure-review.md @@ -72,6 +72,10 @@ Design smells: - For SRP violations and coupling issues, map to the smell → pattern table in `skill://dev-team-knowledge/design-smells.md#design-smells-pattern-mapping`. Every finding should name the smell, quote the code, and include a refactor sketch. - For method-level issues (nesting, long methods, flag arguments), check Object Calisthenics rules 1-2 and 7 in `skill://dev-team-knowledge/object-calisthenics.md`. Whole-file load: the nine-rule catalog is short enough that the agent reads the whole file rather than picking specific rule anchors. +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#structure-review` (structure-review challenge questions). Append confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/svelte-review.md b/plugins/dev-team/agents/svelte-review.md index 5ce2445..b692470 100644 --- a/plugins/dev-team/agents/svelte-review.md +++ b/plugins/dev-team/agents/svelte-review.md @@ -88,6 +88,10 @@ Lifecycle issues: Generic array mutation style (handled by js-fp-review), race conditions in non-reactive paths (handled by concurrency-review), accessibility (handled by a11y-review), code structure, naming, domain modeling, security, complexity (handled by other agents) +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#svelte-review` (the shared challenger loop + the svelte-review challenge questions; ≤3 rounds). Append a confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/test-review.md b/plugins/dev-team/agents/test-review.md index 43584af..144032c 100644 --- a/plugins/dev-team/agents/test-review.md +++ b/plugins/dev-team/agents/test-review.md @@ -101,6 +101,10 @@ Testability blockers: - Mocking of concrete classes (not interfaces) — flag as warning; extract an interface for the dependency - Tests using reflection into private members as primary strategy — flag as warning; the public API surface needs expanding +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#test-review` (test-review challenge questions). Append confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/test-smell-review.md b/plugins/dev-team/agents/test-smell-review.md index 8ebd887..f48fa9c 100644 --- a/plugins/dev-team/agents/test-smell-review.md +++ b/plugins/dev-team/agents/test-smell-review.md @@ -82,6 +82,10 @@ Pyramid placement (load `skill://dev-team-knowledge/test-pyramid.md`): - Unit test doing real I/O (mis-layered → Slow Tests); E2E asserting a single edge case (belongs at unit); suite-level ice-cream-cone / hourglass / cupcake shape +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the test-review challenge pass in `skill://dev-team-knowledge/adversarial-review-protocol.md#test-smell-review`. Append confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/agents/token-efficiency-review.md b/plugins/dev-team/agents/token-efficiency-review.md index 7d2ced0..3890b9d 100644 --- a/plugins/dev-team/agents/token-efficiency-review.md +++ b/plugins/dev-team/agents/token-efficiency-review.md @@ -112,6 +112,10 @@ CLAUDE.md, rules, and skills must follow LLM-native patterns. Flag violations: Code correctness, security, logic (handled by other agents) +## Output discipline + +Derive `status` from the highest-severity finding, never from volume (`skill://dev-team-knowledge/review-output-discipline.md#deterministic-status`), and group same-kind findings — enumerate → classify → group — into ~3–5 concept-level findings per file, keeping `error` findings individual (`skill://dev-team-knowledge/review-output-discipline.md#finding-grouping`). + ## Self-Challenge After producing findings, run the adversarial challenge pass from `skill://dev-team-knowledge/adversarial-review-protocol.md#token-efficiency-review` (the shared challenger loop + the token-efficiency-review challenge questions; ≤3 rounds). Append a confidence level (High/Medium/Low) to the `summary` field. diff --git a/plugins/dev-team/commands/triage.md b/plugins/dev-team/commands/triage.md index 4d361b0..637c935 100644 --- a/plugins/dev-team/commands/triage.md +++ b/plugins/dev-team/commands/triage.md @@ -8,4 +8,4 @@ Arguments: `$ARGUMENTS` (bug description / issue reference). 1. `read skill://triage` and follow it (uses `skill://systematic-debugging`). 2. Write a triage record to `.triage/.md` with root-cause analysis and a - TDD fix plan. Do not fix yet unless asked — triage produces the plan. + fix plan (regression test + fix + refactor). Do not fix yet unless asked — triage produces the plan. diff --git a/plugins/dev-team/prompts/implementer.md b/plugins/dev-team/prompts/implementer.md index e39ca7a..da3b050 100644 --- a/plugins/dev-team/prompts/implementer.md +++ b/plugins/dev-team/prompts/implementer.md @@ -29,9 +29,9 @@ Run **`/impl-verify`** (detects the stack, runs the strict build + tests, bounde - **FAIL** → fix the *cause* and re-run; never silence the gate (`no-disable-analyzers`). If a regression in unrelated tests appears, revert and re-approach — do not "fix" other tests to accommodate your change. - **HALT** (fix budget spent) → escalate to the orchestrator. -### 4. Refactor (optional) +### 4. Refactor (after green — the test-after refactoring step) -Once green, improve structure, naming, and duplication without changing behavior; re-run `/impl-verify` and keep it green. If a refactor suggests a structural change beyond the step's scope, log it as a follow-up and stop — do not expand scope mid-step. +Once green, take a deliberate refactor pass: improve structure, naming, and duplication, and replace reinvented built-ins with the platform (the `refactor-opportunity-review` lens), without changing behavior; re-run `/impl-verify` and keep it green. Always take the pass — this is the *refactoring* half of test-after-with-refactoring; making changes is conditional on finding a real opportunity. If a refactor suggests a structural change beyond the step's scope, log it as a follow-up and stop — do not expand scope mid-step. ### 5. Verification evidence diff --git a/plugins/dev-team/rules/dev-team-operating-manual.md b/plugins/dev-team/rules/dev-team-operating-manual.md index 7803e13..55f3acc 100644 --- a/plugins/dev-team/rules/dev-team-operating-manual.md +++ b/plugins/dev-team/rules/dev-team-operating-manual.md @@ -21,7 +21,7 @@ cannot name the friction it removes does not ship. ## Where the detail lives (load on demand — do not inline it here) - **Orchestration pipeline** — Research → Plan → Implement, human gates, the - five plan-review personas, the three-stage inline review, ATDD, and the + five plan-review personas, the three-stage inline review, acceptance scenarios, and the decision log: owned by the **orchestrator** agent prompt (`## Three-Phase Workflow` / `## Decision Log`). Don't restate it in always-loaded context. - **Model routing** (tier → model) — source of truth is diff --git a/plugins/dev-team/skills/build/SKILL.md b/plugins/dev-team/skills/build/SKILL.md index f9df6a9..ae64f64 100644 --- a/plugins/dev-team/skills/build/SKILL.md +++ b/plugins/dev-team/skills/build/SKILL.md @@ -78,7 +78,7 @@ For each step within a slice, dispatch implementation following the implementer 1. **Implement** — Write the step's code **and its tests** (in whichever order fits — test-first is not required), covering the slice scenario it traces to. Don't add behavior beyond what the step and its tests require. 2. **Verify (hard gate)** — Run **`/impl-verify`** (strict stack build + tests, bounded fix counter). **PASS** → proceed. **FAIL** → fix the *cause* and re-run; never silence the gate (`no-disable-analyzers`). **HALT** (fix budget spent) → escalate to the human. Paste the verdict as evidence. -3. **Refactor (optional)** — Clean up structure, naming, duplication without changing behavior; re-run `/impl-verify` and keep it green. +3. **Refactor (after green — the test-after refactoring step)** — Take a deliberate refactor pass on **every** step once tests are green: scan for structure, naming, duplication, and reinvented built-ins (the `refactor-opportunity-review` lens), apply behavior-preserving cleanups, then re-run `/impl-verify` and keep it green. The pass is always taken; making changes is conditional on finding a real opportunity. This is the *refactoring* half of omp's test-after-with-refactoring — don't skip it. Refactors beyond the step's scope are logged as follow-ups, not done inline. 4. **Inline review checkpoint** — Route review depth based on the step's **Complexity** classification: - **trivial**: Skip inline review. The final `/code-review` (step 6) covers all modified files. - **standard**: Run `/review-agent spec-compliance-review` against changed files. If it passes, run quality review agents relevant to what changed. If review finds actionable issues (error/warning with high/medium confidence), auto-fix and re-run failed agents (up to 5 iterations per the review-fix loop in `agents/orchestrator.md`). Escalate to user if the loop doesn't converge. diff --git a/plugins/dev-team/skills/cd-test-architecture/SKILL.md b/plugins/dev-team/skills/cd-test-architecture/SKILL.md index 6ead772..50fcbe9 100644 --- a/plugins/dev-team/skills/cd-test-architecture/SKILL.md +++ b/plugins/dev-team/skills/cd-test-architecture/SKILL.md @@ -142,4 +142,4 @@ Write to `reports/cd-test-architecture-.md` (or chat for a single component - Pairs with `test-design-advisor` (unit/module design) and the `test-smell-review` / `test-review` agents (per-file findings). This skill sets the application-level target those operate within. - For under-tested/legacy components, the characterization-baseline-then-refactor procedure is the **`legacy-code`** skill (Feathers' algorithm: change points → test points/seams → break dependencies → characterization tests → refactor under green). Defer the mechanics to it. - Use the **`domain-driven-design`** and **`domain-analysis`** skills to suggest the target structure for the post-baseline refactor — where bounded contexts, ports, and seams should land — so refactoring improves the domain model, not just testability. -- Hand the migration path to `/plan` or `/build` for TDD implementation. This skill stops at the architecture and plan. +- Hand the migration path to `/plan` or `/build` for implementation. This skill stops at the architecture and plan. diff --git a/plugins/dev-team/skills/dev-team-knowledge/index.json b/plugins/dev-team/skills/dev-team-knowledge/index.json index b539554..02047f7 100644 --- a/plugins/dev-team/skills/dev-team-knowledge/index.json +++ b/plugins/dev-team/skills/dev-team-knowledge/index.json @@ -529,6 +529,16 @@ "anchor": "how-this-connects-to-the-rest-of-the-toolkit" } }, + "plugins/dev-team/skills/dev-team-knowledge/review-output-discipline.md": { + "Deterministic status": { + "summary": "Agent `status` is a pure function of the highest-severity finding (error->fail, warning/suggestion->warn, none->pass), never of volume; capped findings can lower but never raise status.", + "anchor": "deterministic-status" + }, + "Finding grouping": { + "summary": "Enumerate -> Classify -> Group: consolidate same-kind findings into one concept-level finding (~3-5 per file); keep error-severity findings individual.", + "anchor": "finding-grouping" + } + }, "plugins/dev-team/skills/dev-team-knowledge/review-rubric.md": { "Health Score Calculation": { "summary": "Collect the status from each agent: `pass`, `warn`, `fail`, `skip`.", diff --git a/plugins/dev-team/skills/dev-team-knowledge/review-output-discipline.md b/plugins/dev-team/skills/dev-team-knowledge/review-output-discipline.md new file mode 100644 index 0000000..04352c7 --- /dev/null +++ b/plugins/dev-team/skills/dev-team-knowledge/review-output-discipline.md @@ -0,0 +1,53 @@ +# Review Output Discipline + +Shared output contract for every blocking review agent. Read it before emitting +the findings JSON. Two rules — a deterministic status, and grouped findings — +make reviews reproducible and cheap to read. Each agent's own `Status:` / +`Severity:` lines explain what each severity *means* for that agent; this file +fixes how severity maps to status and how findings are consolidated. + +## Deterministic status + +`status` is a pure function of the **highest-severity finding** — never of how +many findings there are. Volume never elevates the tier. + +| status | condition | +|--------|-----------| +| `fail` | at least one `error` finding | +| `warn` | at least one `warning` or `suggestion` finding (no `error`) | +| `pass` | no findings | +| `skip` | nothing in this agent's scope in the changed files | + +So 1 error → `fail`; 50 suggestions and 0 errors → `warn`; 0 findings → `pass`. +Two findings of the same severity never combine into a higher status. + +Some findings are **capped**: an agent may declare that a class of finding +(e.g. comment hygiene, tracker-ID references) is capped at `suggestion`, so on +its own it can never raise status above `warn`. A cap lowers severity; it never +raises it. + +> Note vs upstream: we keep `warning → warn` (not `warning → fail`) so the +> orchestrator's `review-rubric.md` warn tier stays meaningful and escalation +> stays calibrated to omp's gates. The determinism is the same; only the +> warning threshold differs. + +## Finding grouping + +Report **distinct problems at the concept level**, not one finding per +occurrence. Run three phases before emitting: + +1. **Enumerate** — list every candidate in scope (identifiers, doc targets, + refactor sites, …). Enumerate *all* of them before judging any. +2. **Classify** — apply the detection rules; assign a severity only to what is + actually flagged. +3. **Group** — consolidate related violations into one finding. All occurrences + of one idiom become a single finding (e.g. "non-standard abbreviations: `usr`, + `cfg`, `tmp`" → one finding; all magic numbers → one cluster). Cite the + representative `file`/`line` and list the rest in the `message`. + +Target roughly **3–5 findings per file**. Keep `error`-severity findings +(misleading names, semantic duplication, actively wrong docs) **individual** — +each is independently actionable. Group only the lower-severity, same-kind noise. + +Grouping cuts review tokens and the human's triage time without losing signal: +fewer, denser findings that each map to one fix. diff --git a/plugins/dev-team/skills/dev-team-knowledge/test-strategy.md b/plugins/dev-team/skills/dev-team-knowledge/test-strategy.md index 59bb5a2..e482a3a 100644 --- a/plugins/dev-team/skills/dev-team-knowledge/test-strategy.md +++ b/plugins/dev-team/skills/dev-team-knowledge/test-strategy.md @@ -37,7 +37,7 @@ How the test is written and driven. | Strategy | What it is | Use when | Cost / risk | |---|---|---|---| -| **Scripted Test** | Test programs written by hand (the test code you write in any xUnit) | **Default**, and the only option that can *drive* development (TDD) — there's nothing to record before the code exists | Needs programming skill; not for non-technical authors | +| **Scripted Test** | Test programs written by hand (the test code you write in any xUnit) | **Default**, and the only option that can be authored before the code exists — there's nothing to record before the code exists | Needs programming skill; not for non-technical authors | | **Data-Driven Test** | A common interpreter holds the logic; cases live as rows in an external data table (one line per case) | Many cases vary only by input/expected data; lets non-programmers add cases (e.g. Fit/Fitnesse) | Interpreter is itself code to maintain; failures can be opaque; easy to over-build | | **Recorded Test** | Capture interactions with a running SUT (usually through the UI) and replay them | Regression-testing a **finished, stable** app where little will change | Brittle to UI/behavior change; can't drive development; high re-record cost | | **Test Automation Framework** | The harness that discovers, runs, verifies, and reports tests; you supply only test-specific logic | You **use** one (xUnit, etc.) — rarely build one | Building your own is almost always wasted effort | diff --git a/plugins/dev-team/skills/feedback-learning/SKILL.md b/plugins/dev-team/skills/feedback-learning/SKILL.md index d562224..260ae66 100644 --- a/plugins/dev-team/skills/feedback-learning/SKILL.md +++ b/plugins/dev-team/skills/feedback-learning/SKILL.md @@ -1,6 +1,6 @@ --- name: feedback-learning -description: Capture amend/learn/remember/forget keywords from the user and update agent or skill configurations. Invoke immediately when the user issues any of these trigger words — parse the change, preview a diff, apply it, and log it to the audit trail. +description: Capture amend/learn/remember/forget/review keywords from the user and update agent or skill configurations. Invoke immediately when the user issues any of these trigger words — parse the change, preview a diff, apply it, and log it to the audit trail. `review` dispositions the system-proposed pending-review queue (the closed learning loop). role: orchestrator user-invocable: true --- @@ -17,8 +17,9 @@ Procedure for capturing user feedback, updating configurations dynamically, and | **learn** | Teach something new | `learn: our API uses kebab-case URLs` | | **remember** | Persist a preference across sessions | `remember: always run tests before completing tasks` | | **forget** | Remove a previous preference | `forget: the kebab-case URL convention` | +| **review** | Disposition the pending-review queue | `review pending` / `session review` | -All four follow the same processing flow. The distinction is semantic (helping the user express intent), not mechanical. +`amend`/`learn`/`remember`/`forget` follow the same processing flow — the distinction is semantic (helping the user express intent), not mechanical. `review` is the disposition surface for system-proposed changes (see [Session review](#session-review-disposition)). ## Where Changes Are Written @@ -116,7 +117,7 @@ amend: rollback all changes from today ## Learning Loop -After task completion, the orchestrator captures learnings in two ways: +After task completion, the orchestrator captures learnings in two ways, then **queues every system proposal to an explicit decision** (the closed loop): ### Post-task reflection @@ -145,6 +146,45 @@ The orchestrator also watches for patterns across tasks: When a pattern is detected (minimum 3 occurrences), propose the change with rationale. User approves or rejects. If approved, apply and log with `trigger: "system"`. +### Pending-review queue (closing the loop) + +A system-initiated proposal is only useful if it reaches a decision. When the user is not present to disposition it immediately, **enqueue** it to `metrics/pending-review.jsonl` (append-only, one JSON object per line) rather than dropping it. This closes the loop: every proposal is tracked from generation to an explicit accept/reject. + +```json +{ + "id": "2026-06-25T14:30:00Z-software-engineer-fp", + "proposed_at": "2026-06-25T14:30:00Z", + "trigger": "system", + "source": "recurring-correction", + "type": "amend", + "description": "Software engineer keeps being corrected to prefer functional patterns (3 occurrences)", + "target_file": "CLAUDE.md", + "target_section": "Agent Overrides > Software Engineer", + "proposed_value": "- Prefer functional programming patterns over OOP", + "evidence": ["task-12", "task-15", "task-19"], + "status": "pending" +} +``` + +| Field | Description | +| --- | --- | +| `id` | Stable unique id (timestamp + slug) | +| `source` | What generated it: `recurring-correction` or `post-task-reflection` | +| `evidence` | Task/finding ids that motivated it (≥3 for a recurring correction) | +| `status` | `pending` until dispositioned, then `approved` or `rejected` | + +**No duplicates:** before appending, scan the queue for an open `pending` entry with the same `target_file` + `target_section` + intent and skip if one exists. + +### Session review (disposition) + +This is where the user closes the loop. On request ("review pending", "session review") list every `pending` entry with its evidence, then for each: + +1. **Preview** the proposed edit as a diff. +2. **Approve** → apply via the resolution table, log to `config-changelog.jsonl` with `approved_by: "user"`, then stamp the queue entry `status: "approved"`, `reviewed_at`, `approved_by`. +3. **Reject** → do not apply; stamp the queue entry `status: "rejected"`, `reviewed_at`, `rejected_by`, and an optional `reason`. + +Disposition **updates the entry in place** so the queue stays an accurate ledger of what was proposed and what became of it (distinct from the append-only `config-changelog.jsonl`). Approved structural changes still follow the approval rules above. **Cost note:** batch the queue in one review pass instead of interrupting mid-task — the loop is asynchronous by design, which is what keeps it cheap. + ## Constraints - Never edit plugin cache files — all changes go to project-local files diff --git a/plugins/dev-team/skills/init-dev-team/SKILL.md b/plugins/dev-team/skills/init-dev-team/SKILL.md index b5c6231..6ac29a8 100644 --- a/plugins/dev-team/skills/init-dev-team/SKILL.md +++ b/plugins/dev-team/skills/init-dev-team/SKILL.md @@ -500,7 +500,7 @@ Mutation testing: Java (pitest-maven): ✓ configured [or: ✗ skipped | ✗ manual steps needed] C# (Stryker.NET): ✓ [or: ✗ skipped | ✗ failed] -The mutation gate will now block zero-kill tests after each RED→GREEN transition. +The mutation gate will now block zero-kill tests after each failing→passing transition. Run a test suite to verify: when tests go from failing to passing, the gate should analyze them within 60 seconds. ``` diff --git a/plugins/dev-team/skills/mutation-testing/SKILL.md b/plugins/dev-team/skills/mutation-testing/SKILL.md index b0330e3..719f35d 100644 --- a/plugins/dev-team/skills/mutation-testing/SKILL.md +++ b/plugins/dev-team/skills/mutation-testing/SKILL.md @@ -59,7 +59,7 @@ For each survivor, classify and act: 2. **Check for equivalence** — does the mutation actually change observable behavior? Common equivalent patterns: dead code or unreachable branches; commutative-operation reorderings; conditions redundant with other guards; logging/debug-only code. 3. **Find related tests** — which tests cover this code; what do they assert. 4. **Classify** — missing assertion, missing test, boundary gap, or equivalent. -5. **Write the fix test** with RED-GREEN discipline: must fail against the mutant and pass against the original. +5. **Write the fix test** so it fails against the mutant and passes against the original. ### Weak vs strong test patterns diff --git a/plugins/dev-team/skills/plan/SKILL.md b/plugins/dev-team/skills/plan/SKILL.md index 54b1d61..f076283 100644 --- a/plugins/dev-team/skills/plan/SKILL.md +++ b/plugins/dev-team/skills/plan/SKILL.md @@ -55,7 +55,7 @@ When authoring each slice's Gherkin, cover: - **Edge cases** — empty collections, boundary values, concurrent access, idempotency. - **Error scenarios** — specify observable error behavior, not just "should fail". -Keep scenarios implementation-independent (no databases, selectors, or internal data structures in step text) and deterministic. Every acceptance criterion from the spec must be covered by at least one scenario across the slices. Each TDD step traces back to one or more scenarios in its slice. +Keep scenarios implementation-independent (no databases, selectors, or internal data structures in step text) and deterministic. Every acceptance criterion from the spec must be covered by at least one scenario across the slices. Each build step traces back to one or more scenarios in its slice. ### 3. Create the plan @@ -81,7 +81,7 @@ Write the plan file using this structure: ## Slices A slice is a vertically deliverable increment. Each slice carries the Gherkin -scenario(s) that define its behavior, followed by the TDD steps that satisfy them. +scenario(s) that define its behavior, followed by the build steps that satisfy them. Steps are numbered `.` (1.1, 1.2, 2.1, …). ### Slice 1: @@ -221,7 +221,7 @@ Before presenting to the user, dispatch **five plan review personas in parallel* | Reviewer | Template | Model | Focus | |----------|----------|-------|-------| -| Acceptance Test Critic | `${CLAUDE_PLUGIN_ROOT}/prompts/plan-review-acceptance.md` | `sonnet` | Per-slice Gherkin quality (determinism, isolation, implementation-independence), scenario gaps, error paths, criteria coverage, TDD traceability | +| Acceptance Test Critic | `${CLAUDE_PLUGIN_ROOT}/prompts/plan-review-acceptance.md` | `sonnet` | Per-slice Gherkin quality (determinism, isolation, implementation-independence), scenario gaps, error paths, criteria coverage, step-to-scenario traceability | | Design & Architecture Critic | `${CLAUDE_PLUGIN_ROOT}/prompts/plan-review-design.md` | `sonnet` | Coupling, abstractions, structural risks, pattern adherence | | UX Critic | `${CLAUDE_PLUGIN_ROOT}/prompts/plan-review-ux.md` | `sonnet` | User journey, error UX, cognitive load, accessibility | | Strategic Critic | `${CLAUDE_PLUGIN_ROOT}/prompts/plan-review-strategic.md` | `sonnet` | Problem fit, scope, slice boundaries, risk, opportunity cost | diff --git a/plugins/dev-team/skills/quality-gate-pipeline/SKILL.md b/plugins/dev-team/skills/quality-gate-pipeline/SKILL.md index 3e84701..c89beec 100644 --- a/plugins/dev-team/skills/quality-gate-pipeline/SKILL.md +++ b/plugins/dev-team/skills/quality-gate-pipeline/SKILL.md @@ -81,7 +81,7 @@ When a signal fires: **Pause** → **Verify** (use tools) → **Correct** → ** | Task Type | Additional Evidence | |-----------|-------------------| -| Bug fix | Red-green cycle: failing test → passing test | +| Bug fix | Regression test reproduces the bug, then passes after the fix | | New feature | Feature working via test output or demo command | | Refactor | Same test count, same pass count | | Config change | Config loads without error | diff --git a/plugins/dev-team/skills/test-design-advisor/SKILL.md b/plugins/dev-team/skills/test-design-advisor/SKILL.md index 3c0d12a..2dbe6c5 100644 --- a/plugins/dev-team/skills/test-design-advisor/SKILL.md +++ b/plugins/dev-team/skills/test-design-advisor/SKILL.md @@ -136,4 +136,4 @@ A concise advisory report (to chat for a single unit, or to `reports/test-design - Pairs with the `test-smell-review` agent (which *detects* smells) and the `test-design-reviewer` skill (which *scores* an existing suite). This skill *designs* tests forward. - For application-level test architecture (CD pipeline alignment, deterministic config-free CI gate, per-component UI/service/batch patterns), defer to the `cd-test-architecture` skill and its knowledge files (`cd-test-architecture.md`, `component-test-patterns.md`). This skill stays at unit/module altitude. -- Hand the refactor sequence to `/plan` or `/build` for TDD implementation. This skill stops at the design. +- Hand the refactor sequence to `/plan` or `/build` for implementation. This skill stops at the design. diff --git a/plugins/dev-team/skills/test-design-reviewer/SKILL.md b/plugins/dev-team/skills/test-design-reviewer/SKILL.md index dfa85b6..347f0a6 100644 --- a/plugins/dev-team/skills/test-design-reviewer/SKILL.md +++ b/plugins/dev-team/skills/test-design-reviewer/SKILL.md @@ -26,7 +26,7 @@ Each property is scored 1-10: | 5 | **Necessary** | 1.0 | Does it verify behavior that matters? Not testing framework code, not duplicating another test, covers a real scenario or edge case. | | 6 | **Granular** | 1.0 | Does it fail with a clear, specific message? Pinpoints the failure location, doesn't require debugging to understand what broke. | | 7 | **Fast** | 0.8 | Does it run quickly enough for the feedback loop? Unit tests <100ms, integration tests <5s, E2E tests <30s. | -| 8 | **First** | 1.0 | Was it written before or alongside the implementation (TDD)? Evidence: test commit predates implementation, test names describe behavior not implementation. | +| 8 | **Timely** | 1.0 | Does it ship with the implementation in the same change (test-after, not deferred)? Evidence: test and implementation in the same commit/PR, test names describe behavior not implementation. | ## Scoring @@ -70,7 +70,7 @@ Average the per-test scores. Report the distribution (how many Exemplary, Good, ### Top Issues 1. **Maintainability** (avg 5.2): 4 tests coupled to implementation details — use behavior-based assertions 2. **Repeatability** (avg 6.0): 2 tests use `Date.now()` — inject time dependency -3. **First** (avg 6.5): Test names describe implementation ("calls handleSubmit") not behavior ("submits form data") +3. **Timely** (avg 6.5): tests deferred or named after implementation ("calls handleSubmit") instead of behavior ("submits form data") ### Per-Test Scores (lowest first) | Test | Score | Weakest Property | Suggestion | diff --git a/plugins/dev-team/skills/triage/SKILL.md b/plugins/dev-team/skills/triage/SKILL.md index 26e3896..1553628 100644 --- a/plugins/dev-team/skills/triage/SKILL.md +++ b/plugins/dev-team/skills/triage/SKILL.md @@ -2,7 +2,7 @@ name: triage description: >- Investigate a bug, find its root cause, and write a portable triage record to - .triage/.md with a TDD fix plan. Use when the user reports a bug and + .triage/.md with a fix plan (regression test + fix + refactor). Use when the user reports a bug and wants it triaged, says "triage this", "investigate and write it up", or wants a hands-off bug investigation that produces an actionable record. argument-hint: "" @@ -14,7 +14,7 @@ allowed-tools: read, find, search, bash, write, task Role: worker. -Investigate a bug hands-off, find root cause, and write a TDD fix plan to a +Investigate a bug hands-off, find root cause, and write a fix plan to a portable triage record at `.triage/.md` — no issue-tracker dependency. ## Worker constraints @@ -47,14 +47,18 @@ related source files and dependencies, existing tests (covered vs missing), recent changes to affected files (`git log`), error handling in the code path, and similar patterns elsewhere that work correctly. -### 3. Design TDD Fix Plan +### 3. Design the Fix Plan -Create an ordered list of RED-GREEN cycles, each a vertical slice: +Create an ordered list of fix steps, each a vertical slice: -- **RED**: A specific test capturing broken/missing behavior -- **GREEN**: The minimal code change to make that test pass +- **Regression test**: a specific test capturing the broken/missing behavior +- **Fix**: the minimal code change that makes that test pass -Tests verify behavior at the public interface, not implementation details. +Tests are required but order is not enforced (test-after) — for a bug, a +failing-first regression test is encouraged because it proves both the defect +and the fix, but it isn't mandatory. Tests verify behavior at the public +interface, not implementation details. Close every plan with a **refactor** +checkpoint once the tests pass (the test-after refactoring step). ### 4. Write the Triage Record @@ -106,15 +110,15 @@ status: open [What code path is involved, why it fails, contributing factors. Describe modules and behaviors, not file paths — the record should survive refactors.] -## TDD Fix Plan +## Fix Plan -1. **RED**: Write a test that [expected behavior] - **GREEN**: [Minimal change to pass] +1. **Test**: Write a regression test that [expected behavior] + **Fix**: [Minimal change to pass] -2. **RED**: Write a test that [next behavior] - **GREEN**: [Minimal change to pass] +2. **Test**: Write a regression test that [next behavior] + **Fix**: [Minimal change to pass] -**REFACTOR**: [Any cleanup after all tests pass] +**Refactor**: [Any cleanup once all tests pass] ## Acceptance Criteria @@ -124,8 +128,8 @@ modules and behaviors, not file paths — the record should survive refactors.] - [ ] No regressions introduced ``` -At least one RED/GREEN cycle is required. **If no root cause was determined,** -the entire `## TDD Fix Plan` body is exactly: +At least one test + fix step is required. **If no root cause was determined,** +the entire `## Fix Plan` body is exactly: ``` Root cause not determined — manual investigation required diff --git a/scripts/ci-framework-compliance.mjs b/scripts/ci-framework-compliance.mjs new file mode 100644 index 0000000..86d8886 --- /dev/null +++ b/scripts/ci-framework-compliance.mjs @@ -0,0 +1,104 @@ +// Framework-compliance checks for the dev-team plugin (cross-platform; used by CI). +// +// These automate what prior extraction PRs verified by hand: +// A. Every `skill://dev-team-knowledge/.md#` reference resolves +// — the target file exists AND the anchor matches a heading slug OR a +// registered anchor in index.json (the project's hand-authored anchor +// registry, which doesn't always follow strict GitHub slug rules). +// B. Every file keyed in index.json exists. +// C. Every finding-emitting review agent references the shared +// review-output-discipline contract (deterministic status + grouping). +// D. Deliberately-removed TDD identifiers don't creep back in (test-after is +// a framework choice) — outside a small rationale/historical allowlist. +// +// Pure Node, no dependencies. Exit non-zero on any violation. +import { readFileSync, readdirSync, statSync } from "node:fs"; +import { join } from "node:path"; + +const ROOT = "plugins/dev-team"; +const KDIR = join(ROOT, "skills/dev-team-knowledge"); +let failures = 0; +const fail = (check, msg) => { console.error(`FAIL [${check}] ${msg}`); failures++; }; + +function walk(dir, exts, out = []) { + for (const e of readdirSync(dir)) { + if (e === "node_modules" || e === ".git") continue; + const p = join(dir, e), s = statSync(p); + if (s.isDirectory()) walk(p, exts, out); + else if (exts.some((x) => e.endsWith(x))) out.push(p); + } + return out; +} +function exists(p) { try { statSync(p); return true; } catch { return false; } } + +// GitHub-style heading slugger (lowercase, strip punctuation except word/space/ +// hyphen, each whitespace -> hyphen with no collapse, duplicate suffix -N). +function headingSlugs(file) { + const counts = new Map(), set = new Set(); + for (const line of readFileSync(file, "utf8").split(/\r?\n/)) { + const m = /^#{1,6}\s+(.*)$/.exec(line); + if (!m) continue; + const base = m[1].trim().toLowerCase().replace(/[^\w\s-]/g, "").replace(/\s/g, "-"); + let s = base; + if (counts.has(base)) { const n = counts.get(base) + 1; counts.set(base, n); s = `${base}-${n}`; } + else counts.set(base, 0); + set.add(s); + } + return set; +} + +const index = JSON.parse(readFileSync(join(KDIR, "index.json"), "utf8")); +const indexAnchorsFor = (file) => { + const entry = index[join(KDIR, file).replaceAll("\\", "/")]; + return entry ? new Set(Object.values(entry).map((v) => v.anchor)) : new Set(); +}; + +// ---- A. Anchor references resolve ---- +const refRe = /skill:\/\/dev-team-knowledge\/([A-Za-z0-9._-]+\.md)#([A-Za-z0-9-]+)/g; +const refs = new Set(); +for (const f of walk(ROOT, [".md"])) { + const t = readFileSync(f, "utf8"); let m; + while ((m = refRe.exec(t))) refs.add(`${m[1]}#${m[2]}`); +} +for (const r of [...refs].sort()) { + const [file, anchor] = r.split("#"); + const fp = join(KDIR, file); + if (!exists(fp)) { fail("anchor", `${r} — target file missing`); continue; } + if (!headingSlugs(fp).has(anchor) && !indexAnchorsFor(file).has(anchor)) + fail("anchor", `${r} — anchor matches no heading and is not in index.json`); +} + +// ---- B. index.json keyed files exist ---- +for (const rel of Object.keys(index)) + if (!exists(rel)) fail("index", `${rel} — keyed in index.json but missing`); + +// ---- C. Review-agent output-discipline wiring ---- +const CONTRACT = '"status": "pass|warn|fail|skip"'; +const ALLOW_NO_DISCIPLINE = new Set(["progress-guardian.md"]); // not a lexical finding agent +for (const a of readdirSync(join(ROOT, "agents")).filter((f) => f.endsWith(".md"))) { + const t = readFileSync(join(ROOT, "agents", a), "utf8"); + if (t.includes(CONTRACT) && !t.includes("review-output-discipline.md") && !ALLOW_NO_DISCIPLINE.has(a)) + fail("wiring", `${a} — emits the findings contract but doesn't reference review-output-discipline.md`); +} + +// ---- D. No deliberately-removed TDD identifiers ---- +const FORBIDDEN = [/tdd-first/i, /tdd-guard/i, /test-driven-development/i, /RED→GREEN/, /RED-GREEN/]; +const ALLOW_TDD = new Set([ + "plugins/dev-team/skills/testing-discipline/SKILL.md", // explains why ordering is dropped + "plugins/dev-team/extensions/spec-guard.ts", // comment: formerly tdd-guard + "plugins/dev-team/README.md", // rationale: test-first not enforced + "plugins/dev-team/README.fr.md", +]); +for (const f of walk(ROOT, [".md", ".ts"])) { + if (ALLOW_TDD.has(f.replaceAll("\\", "/"))) continue; + const t = readFileSync(f, "utf8"); + for (const re of FORBIDDEN) { + const m = re.exec(t); + if (m) { fail("test-after", `${f} — forbidden TDD identifier "${m[0]}" (test-after is the framework choice)`); break; } + } +} + +console.log( + `Framework compliance: ${refs.size} anchor refs, ${Object.keys(index).length} index files checked — ${failures} violation(s).` +); +process.exit(failures ? 1 : 0);