test(e2e): expand agent-pane coverage, add release report, fix harness flakiness by vanzue · Pull Request #356 · microsoft/intelligent-terminal

vanzue · 2026-06-24T14:00:07Z

What

Expands and hardens the ItE2E end-to-end suite, adds a release-report generator, and
fixes the test-harness flakiness that was producing spurious failures. Net result on the full
Feature run: 0 failed (down from 9), with honest skips only.

All changes are under test/e2e/ (PowerShell harness + tests) — no product code is touched.

Coverage added / strengthened

Strengthened weak agent-pane tests (were token assertions):
- /model now opens and asserts the real model picker (not just "menu renders").
- Open/hide exercised at all four pane positions (was right+bottom only).
- Tab-close now asserts the pre-warmed helper is actually torn down (descendant wta.exe count).
- Real Shift+Enter via a win32-input-mode sequence (new Send-AgentShiftEnter), instead of substituting a plain Enter.
New suites:
- Feature.ShellIntegration.Tests.ps1 — §3 OSC 133 marks (success/failure) + non-integrated cmd.exe safety (deterministic).
- Feature.AgentProposedCommand.Tests.ps1 — §2 agent-proposed command Insert/Run into the shell pane via the non-autofix chat path.
- Feature.AgentMatrix.Tests.ps1 — §2/§3 Claude/Codex/Gemini chat + autofix through the ACP adapter, per-CLI auth-gated (runs only when the CLI is installed and authenticated, else skips with the reason recorded).
- §2 "View switch preserves input" (draft survives a session-view round-trip) in Feature.SessionList.Tests.ps1.

Release report generator

test/e2e/New-ReleaseReport.ps1 (+ release-coverage-map.psd1) turns doc/release-check-list.md
into a clean, human-facing report driven purely by test results:

all [UT✓]/[E2E]/[MANUAL] jargon stripped,
[x] = verified by automation (UT or E2E), ⚠️ AUTOMATION FAILED = a test ran and failed, plain [ ] = verify manually,
conservative mapping (unmapped → manual, never a false [x]); merges multiple results files.

Latest run: 104 verified / 0 failed / 131 manual (of 235).

Stability fixes (the flaky-failure root causes)

Stale-instance cleanup — Start-Terminal now closes any leftover IT window (store + dev) before launching, so a prior test's crashed BeforeAll can't leave a window that the single-instance launch attaches to in a broken state (new-tab → CreateTab E_FAIL 0x80004005) or that collides on the shared per-brand COM CLSID.
Deterministic Wait-AgentReady — judges readiness by the user-visible connected input placeholder ("Ask anything, / for commands..", rendered only in ConnectionState::Connected), not by an internal session-registry artifact, and returns the instant it's observed. Fixes the "agent-pane readiness" timeout flake on initial connect and /restart/settings reconnect.
WSL-autofix Describe now skips (try/catch) instead of failing when a build can't create a wsl.exe tab; Shift+Enter injection retried under load.

Verification

Full Feature run	pass	fail	skip
before stale-cleanup	80	9	8
after stale-cleanup	91	3	3
after readiness fix	89	0	8

The 8 skips are honest/environment-gated (gemini unauthenticated, WSL has no in-distro shell
integration, wta sessions list identity-gated, and autofix LLM-variance where the agent
returns an explanation instead of a card).

Follow-ups (not in this PR)

Filed Autofix card not dismissed by Esc (intermittent; E2E-discovered) #346 — autofix card not dismissed by Esc (intermittent; E2E-discovered).
The autofix LLM-variance skips (agent returns "explain", not a card) are non-deterministic and
slow; the planned mock ACP agent — Form B (doc/specs/mock-acp-agent.md,
mock-acp-agent.exe as a custom: agent) would make that surface deterministic.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

…oposed-command suites - Remove redundant fixed Start-Sleep in AutofixPane (rely on poll-based Assert-Pane). - Strengthen /model picker, all-four pane positions, helper-cleanup, real Shift+Enter (new Send-AgentShiftEnter win32-input-mode helper). - Add Feature.ShellIntegration.Tests.ps1 (OSC 133 marks + cmd.exe missing-integration safety). - Add Feature.AgentProposedCommand.Tests.ps1 (non-autofix chat Insert/Run recommendation card). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Match the post-merge convention so the new ShellIntegration and AgentProposedCommand suites honor ITE2E_PACKAGE instead of hardcoding Store. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add a deterministic SessionList test: a typed-but-unsubmitted draft survives a round-trip through the session view (open + Esc back to chat). No LLM involved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…-gated) Cover §2 Claude/Codex/Gemini chat through the IT agent pane's ACP adapters. Each per-CLI Context runs only when the CLI is installed AND authenticated (print-mode auth probe at discovery), else skips with the reason recorded. Verified live: Claude + Codex pass, Gemini skips (installed but unauthenticated here). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…x/Gemini) New-ReleaseReport.ps1 turns the release checklist into a clean human-facing report driven purely by test results: tags (UT/E2E/MANUAL) stripped; [x]=automation passed, 'AUTOMATION FAILED'=test failed, plain [ ]=not covered, verify manually. Mapping is title-substring + a curated override map (release-coverage-map.psd1), conservative by design (unmapped -> manual, never a false [x]). Also extend the agent matrix with a per-CLI autofix case (§3 Autofix with Claude/Codex/Gemini). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- New-ReleaseReport.ps1 accepts multiple -ResultsXml (later overrides earlier per test name), so an isolated re-run of a flaky suite layers onto the full run. - release-coverage-map.psd1: map a few passing tests whose names differ from their checklist titles (Focus hotkey, Model control/changes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…b E_FAIL flake) A prior test whose AfterAll/Stop-Terminal didn't run (e.g. a BeforeAll that threw) leaves an IT window behind. The single-instance AUMID launch then hands off to that stale, often half-initialised window (Launched=false), so the harness drives a broken instance where new-tab returns CreateTab E_FAIL (0x80004005); and because the store and dev packages share one per-brand COM CLSID, a stale window of the OTHER package steals wtcli's CoCreateInstance and misroutes every call. Start-Terminal now calls Stop-StaleItInstances first, closing every leftover IT window (store + dev, matched by *IntelligentTerminal* install location only — never the user's stock WT) so each launch is deterministic and freshly-owned. Verified: a simulated leftover is cleaned and the fresh instance's new-tab wsl.exe succeeds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- WSL autofix Describe: wrap the WSL-pane setup in try/catch so a build that can't create a wsl.exe tab via the protocol (stale dev pkg predating OSC 9001 -> CreateTab E_FAIL) SKIPS via the per-It guards instead of failing the Describe in BeforeAll. - Shift+Enter on a live session row: skip if no selectable row; retry the raw win32-input keystroke up to 3x while polling for the view to dismiss (the injection can drop under load). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…st map - Generator: an originally-ticked [x] item is verified by an automated unit test, so credit it as passed (unless a mapped E2E test failed). Unit tests are automation; the human needn't re-verify them. - Map: drop backticks from keys (the report strips them from titles, so backticked keys never matched -> false manual); add /model, Shift+Enter, Autofix-with-Copilot, FRE auto-error on-variants, session-mgmt-choice-persists, packaging/logging name mismatches. - Net on the last full run: 79 -> 104 verified, 156 -> 130 manual. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tion) The agent-pane-readiness flake: Wait-AgentReady matched the helper-log 'acp_initialize' marker, which fires several seconds BEFORE the helper writes its session origin ('recording agent-pane session origin') -> the jsonl that Get-AgentPaneSession reads. So it returned ready too early and the next agent-pane call (Send-AgentKey/Open-SessionList) raced a not-yet-written record and timed out. The agent_status connected/failed event is NOT broadcast to wtcli listen (verified), so events can't be used. Wait-AgentReady now polls Get-AgentPaneSession (the exact precondition every primitive needs: a recorded, running pane session) and returns the instant it resolves — deterministic, not a fixed delay — for both the initial connect and a reconnect after /restart or a settings-driven rebuild (newest running record wins). A logged auth/fatal failure short-circuits. The AgentRestart test now waits for reconnect-readiness after the settings change before driving the menu. Verified: 3/3 consecutive green (the test previously flaked on a 20s timeout). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… registry Per review: gating Wait-AgentReady on Get-AgentPaneSession (which reads the agent-pane-sessions.jsonl session registry) is verifying a feature with that same feature — if the registry breaks, the gate false-readies or hangs and masks the bug. Wait-AgentReady now matches the agent pane buffer for the connected input placeholder ('Ask anything, / for commands..'), which the TUI renders ONLY in ConnectionState::Connected (ui/input.rs:62; the connecting/disconnected placeholders are distinct strings). That is the user-visible ground truth of 'ready to chat', independent of the session-tracking feature, and still returns the instant it's observed (deterministic). Auth/fatal log markers short-circuit. Verified: AgentRestart 2/2 green (initial connect + reconnect). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR expands and hardens the PowerShell-driven ItE2E end-to-end suite under test/e2e/, adds a release-report generator that maps checklist items to test outcomes, and reduces harness/test flakiness by replacing fixed sleeps with polling and by proactively cleaning up stale Intelligent Terminal instances before launch.

Changes:

Added new E2E suites for shell integration (OSC 133), agent-proposed command Insert/Run, and an auth-gated multi-agent (Claude/Codex/Gemini) matrix.
Hardened existing agent-pane coverage (pane positions, /model picker, draft preservation, real Shift+Enter injection) and reduced fixed-delay sleeps in favor of polling.
Added New-ReleaseReport.ps1 + a coverage map to generate a checklist-like release report from NUnit/Pester results.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
test/e2e/tests/Feature.ShellIntegration.Tests.ps1	New suite validating OSC 133 success/failure marks and cmd.exe “no shell integration” safety.
test/e2e/tests/Feature.SessionList.Tests.ps1	Adds a deterministic assertion that draft input survives a session-view round-trip.
test/e2e/tests/Feature.AutofixPane.Tests.ps1	Replaces fixed sleeps with polling; wraps WSL setup to skip cleanly when unsupported.
test/e2e/tests/Feature.AgentRestart.Tests.ps1	Adds readiness gating around settings-driven reconnect and uses real Shift+Enter injection with retries.
test/e2e/tests/Feature.AgentProposedCommand.Tests.ps1	New suite covering Insert/Run recommendation cards via the non-autofix chat path.
test/e2e/tests/Feature.AgentPaneInteraction.Tests.ps1	Expands open/hide coverage across all pane positions; strengthens /model assertions; verifies helper teardown on tab close.
test/e2e/tests/Feature.AgentMatrix.Tests.ps1	New auth-gated Claude/Codex/Gemini chat + autofix coverage through the ACP adapter.
test/e2e/release-coverage-map.psd1	New checklist-title → test-name regex mapping for release report generation.
test/e2e/README.md	Updates suite inventory and status/coverage description to reflect new tests and gating behavior.
test/e2e/New-ReleaseReport.ps1	New script to generate a human-facing release checklist report from NUnit/Pester results + mapping.
test/e2e/ItE2E/Public/Harness.ps1	Adds stale-instance cleanup and makes Start-Terminal always clear stale IT instances before config/launch.
test/e2e/ItE2E/Public/AgentInput.ps1	Adds `Send-AgentShiftEnter` using win32-input-mode raw sequences.
test/e2e/ItE2E/Public/Agent.ps1	Reworks `Wait-AgentReady` to gate on user-visible connected placeholder instead of internal artifacts.
test/e2e/ItE2E/ItE2E.psm1	Exports new public harness/input helpers (`Stop-StaleItInstances`, `Send-AgentShiftEnter`).

…ilot review) Match 'Ask anything … for commands' in order on one line instead of either fragment anywhere in the captured scrollback, so stray transcript/help text can't false-positive the readiness gate. Verified: AgentRestart 2/2 green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

…pilot review) Split the Insert and Run cases into separate Describes, each with its own fresh terminal (matching Feature.AutofixPane). With the shared terminal, a prior card's 'Run command'/'Insert in Terminal' text lingered in the scrollback and could co-occur with the next case's marker (echoed in the prompt) to false-positive the card-readiness check before a fresh card rendered. Verified: both cases 2/2 green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

…ss comment (Copilot review) - AgentMatrix: report a PRECISE per-agent skip reason (not installed vs installed-but- unauthenticated vs package missing) via Set-ItResult instead of a boolean Context -Skip, so CI shows why; no terminal is launched when skipping. Re-checks package presence in BeforeAll because a script-scoped var from BeforeDiscovery does not persist into the run phase (only the -ForEach data does). - AgentMatrix: chat assertion uses a word-boundary match instead of a bare '7'. - Harness: correct the Stop-StaleItInstances comment - it makes -ColdStart redundant, but -ShowFre still controls whether the FRE overlay is shown. Verified: Claude/Codex chat+autofix pass; Gemini skips with "installed but not authenticated". Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated no new comments.

…te checklist Per design decision: Copilot is the primary agent and its full behaviour (chat, autofix, insert/run, permission, render, slash, sessions) is covered in depth by the copilot-only suites. All built-in agents share the same agent-pane -> helper -> master -> agent-CLI (ACP) path; the only per-agent difference is the spawned command. So we stop re-testing every behaviour per agent. - Feature.AgentMatrix.Tests.ps1: collapsed from a per-agent (Claude/Codex/Gemini) x chat+autofix matrix into ONE consolidated test case that, for each installed+authenticated non-Copilot agent, does a single connect + chat round-trip in its own fresh terminal; skips when none is available. - doc/release-check-list.md: collapsed the per-agent items (Claude/Codex/Gemini chat, autofix, delegate, installed, hook-install; and the custom-agent behavioural items) into single consolidated items, keeping Copilot as the primary and the config/selection/tracking items. Total 235 -> 220 items. - release-coverage-map.psd1 + README updated to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tention, /sessions) Adds three high-confidence, deterministic E2E cases (no mock/agent variance) plus one mapping fix, expanding genuine release-checklist coverage: - §9 "WT_COM_CLSID is injected": read $env:WT_COM_CLSID back from a shell pane and assert a braced CLSID, proving WT injects protocol discovery into panes. - §10 "Old log cleanup is safe": seed a sentinel in the running version's log dir + a stale other-version dir, restart the build, assert the running version's logs survive and the stale version dir is pruned wholesale. - §4 "Slash command works": /sessions opens the session view (the command-menu path, complementing the existing button path). - §10 "Early startup failures are logged": coverage-map override (the test "...would be logged" already exists and passes). All three new cases validated live against the Store package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

…ader ctor throws (Copilot review) - Feature.AutofixPane: CardShown and the two other card-detection predicates (lines 30/102/271) hard-coded the English "Run command|Insert in Terminal" labels, so they'd mis-skip/fail on non-en-US machines. Added an exported Get-RecommendationCardRegex helper (EITHER button label, localized across all bundled locales via Get-WtaLocalizedTextRegex, en-US fallback) and routed all three through it. Verified it matches the en-US card line; the variance-skip path still works live. - Wait-AgentReady: if the StreamReader ctor throws, $fs was left undisposed (file-handle leak). Wrapped the reader in a nested try/finally so $fs is always disposed (double-dispose after the reader closes it is a safe no-op). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

…tRegex (Copilot review) Double-quoted YAML scalars were captured raw, so a value containing \" (e.g. setup.subtitle.* "Your agent \"%{agent}\" …") kept the literal backslashes — the generated regex then looked for backslashes absent from the rendered UI text, breaking locale-robust assertions for such keys. Now the double-quoted branch unescapes \" \\ \n \t \r (\x -> x) before the value is regex-escaped. Verified: setup.subtitle.copilot_missing no longer yields a regex containing \" (the escape is resolved to a literal "), while the keys the tests actually use (no backslashes) are unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

…ll (Copilot review) Four BeforeAll blocks piped Wait-AgentReady to Out-Null, discarding its boolean — so an auth/fatal connect failure would proceed in a not-ready state and surface later as opaque card-polling failures. Assert | Should -BeTrue with a clear -Because in all four (AutofixPane card-render + AutofixPane WSL setup, AgentProposedCommand Insert + Run), so a failed/again-auth connect fails immediately and attributably. (The WSL one is inside the best-effort try/catch, so a readiness failure there is logged and degrades to a skip via the existing per-It $wslShell guards.) Verified live: the Insert BeforeAll assertion passes when copilot connects. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

…lure mark (Copilot review) - Get-SessionViewRenderRegex: matched the FULL localized agents.footer_hint, but the TUI end-truncates that hint to the pane width (agents_view.rs render_footer_hint -> trunc), so the full line may never appear and Open/Test-SessionListShown could time out. Every bundled locale leads the hint with the invariant nav arrows "↑ ↓" (en "(↑ ↓ to navigate …)", zh "(↑ ↓ 导航 …)"), and being at the start they survive truncation — so match those (en-US footer words kept as an extra fallback). Verified live: the rendered footer matches; the slash-/sessions path is green. - Feature.ShellIntegration failure-mark test: Wait-WtCommandFailure listened to the global vt_sequence stream, so an unrelated OSC 133;D mark could satisfy it. The event's `pane_id` equals the pane session_id (Get-ActivePane.session_id), whereas its `tab_id` is a GUID and Get-ActivePane/Get-WtTabs expose tab_id only as a numeric INDEX — so added a -PaneId filter to Wait-WtCommandFailure and scoped the assertion to the active pane. Verified live: passes scoped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

…ish reconnect probe (Copilot review) - Feature.AutofixPane Run-action: the card-detection predicate still hard-coded the English "Run command" label (missed in the earlier sweep). Routed it through Get-RecommendationCardRegex like the other card-detection sites so it's locale-robust. - Feature.AgentRestart: removed the post-/restart `Test-Until … -match 'Ask anything|Copilot|Agent'` reconnect probe — it matched hard-coded English (not locale-robust) and was redundant with the Wait-AgentReady | Should -BeTrue gate immediately after, which is the deterministic reconnect-and-ready signal. Verified live: the restart case still passes (37s). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

…opilot restart (Copilot review) The override regex 'connects and answers' for "Non-Copilot agents chat works" also matched the Copilot restart test name "(/restart reconnects and answers)" — "reconnects and answers" contains "connects and answers" — which could credit the checklist item from the wrong test in the report. Anchored on "non-Copilot agent.*connects and answers" so it uniquely matches the AgentMatrix case. Verified: matches the AgentMatrix name, does NOT match the Copilot restart name. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

vanzue and others added 12 commits June 24, 2026 11:30

Merge remote-tracking branch 'origin/main'

efb5c12

test(e2e): align new suites to Get-ItTestPackage package selector

a8a29e9

Match the post-merge convention so the new ShellIntegration and AgentProposedCommand suites honor ITE2E_PACKAGE instead of hardcoding Store. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

test(e2e): cover §2 view-switch draft-input preservation

a895452

Add a deterministic SessionList test: a typed-but-unsubmitted draft survives a round-trip through the session view (open + Esc back to chat). No LLM involved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 24, 2026 14:00

Copilot started reviewing on behalf of vanzue June 24, 2026 14:00 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread test/e2e/ItE2E/Public/Agent.ps1 Outdated

vanzue requested a review from Copilot June 24, 2026 14:15

Copilot started reviewing on behalf of vanzue June 24, 2026 14:15 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread test/e2e/tests/Feature.AgentProposedCommand.Tests.ps1

vanzue requested a review from Copilot June 24, 2026 14:35

Copilot started reviewing on behalf of vanzue June 24, 2026 14:36 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread test/e2e/ItE2E/Public/Harness.ps1 Outdated

Comment thread test/e2e/tests/Feature.AgentMatrix.Tests.ps1 Outdated

Comment thread test/e2e/tests/Feature.AgentMatrix.Tests.ps1 Outdated

vanzue requested a review from Copilot June 24, 2026 15:00

Copilot started reviewing on behalf of vanzue June 24, 2026 15:01 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

vanzue and others added 2 commits June 24, 2026 23:38

Copilot AI review requested due to automatic review settings June 25, 2026 00:24

vanzue requested a review from Copilot June 25, 2026 07:16

Copilot started reviewing on behalf of vanzue June 25, 2026 07:16 View session