test(e2e): expand agent-pane coverage, add release report, fix harness flakiness#356
Merged
Conversation
…oposed-command suites - Remove redundant fixed Start-Sleep in AutofixPane (rely on poll-based Assert-Pane). - Strengthen /model picker, all-four pane positions, helper-cleanup, real Shift+Enter (new Send-AgentShiftEnter win32-input-mode helper). - Add Feature.ShellIntegration.Tests.ps1 (OSC 133 marks + cmd.exe missing-integration safety). - Add Feature.AgentProposedCommand.Tests.ps1 (non-autofix chat Insert/Run recommendation card). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Match the post-merge convention so the new ShellIntegration and AgentProposedCommand suites honor ITE2E_PACKAGE instead of hardcoding Store. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a deterministic SessionList test: a typed-but-unsubmitted draft survives a round-trip through the session view (open + Esc back to chat). No LLM involved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-gated) Cover §2 Claude/Codex/Gemini chat through the IT agent pane's ACP adapters. Each per-CLI Context runs only when the CLI is installed AND authenticated (print-mode auth probe at discovery), else skips with the reason recorded. Verified live: Claude + Codex pass, Gemini skips (installed but unauthenticated here). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…x/Gemini) New-ReleaseReport.ps1 turns the release checklist into a clean human-facing report driven purely by test results: tags (UT/E2E/MANUAL) stripped; [x]=automation passed, 'AUTOMATION FAILED'=test failed, plain [ ]=not covered, verify manually. Mapping is title-substring + a curated override map (release-coverage-map.psd1), conservative by design (unmapped -> manual, never a false [x]). Also extend the agent matrix with a per-CLI autofix case (§3 Autofix with Claude/Codex/Gemini). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- New-ReleaseReport.ps1 accepts multiple -ResultsXml (later overrides earlier per test name), so an isolated re-run of a flaky suite layers onto the full run. - release-coverage-map.psd1: map a few passing tests whose names differ from their checklist titles (Focus hotkey, Model control/changes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…b E_FAIL flake) A prior test whose AfterAll/Stop-Terminal didn't run (e.g. a BeforeAll that threw) leaves an IT window behind. The single-instance AUMID launch then hands off to that stale, often half-initialised window (Launched=false), so the harness drives a broken instance where new-tab returns CreateTab E_FAIL (0x80004005); and because the store and dev packages share one per-brand COM CLSID, a stale window of the OTHER package steals wtcli's CoCreateInstance and misroutes every call. Start-Terminal now calls Stop-StaleItInstances first, closing every leftover IT window (store + dev, matched by *IntelligentTerminal* install location only — never the user's stock WT) so each launch is deterministic and freshly-owned. Verified: a simulated leftover is cleaned and the fresh instance's new-tab wsl.exe succeeds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- WSL autofix Describe: wrap the WSL-pane setup in try/catch so a build that can't create a wsl.exe tab via the protocol (stale dev pkg predating OSC 9001 -> CreateTab E_FAIL) SKIPS via the per-It guards instead of failing the Describe in BeforeAll. - Shift+Enter on a live session row: skip if no selectable row; retry the raw win32-input keystroke up to 3x while polling for the view to dismiss (the injection can drop under load). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…st map - Generator: an originally-ticked [x] item is verified by an automated unit test, so credit it as passed (unless a mapped E2E test failed). Unit tests are automation; the human needn't re-verify them. - Map: drop backticks from keys (the report strips them from titles, so backticked keys never matched -> false manual); add /model, Shift+Enter, Autofix-with-Copilot, FRE auto-error on-variants, session-mgmt-choice-persists, packaging/logging name mismatches. - Net on the last full run: 79 -> 104 verified, 156 -> 130 manual. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion)
The agent-pane-readiness flake: Wait-AgentReady matched the helper-log 'acp_initialize'
marker, which fires several seconds BEFORE the helper writes its session origin
('recording agent-pane session origin') -> the jsonl that Get-AgentPaneSession reads. So it
returned ready too early and the next agent-pane call (Send-AgentKey/Open-SessionList) raced
a not-yet-written record and timed out. The agent_status connected/failed event is NOT
broadcast to wtcli listen (verified), so events can't be used.
Wait-AgentReady now polls Get-AgentPaneSession (the exact precondition every primitive needs:
a recorded, running pane session) and returns the instant it resolves — deterministic, not a
fixed delay — for both the initial connect and a reconnect after /restart or a settings-driven
rebuild (newest running record wins). A logged auth/fatal failure short-circuits. The
AgentRestart test now waits for reconnect-readiness after the settings change before driving
the menu. Verified: 3/3 consecutive green (the test previously flaked on a 20s timeout).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… registry
Per review: gating Wait-AgentReady on Get-AgentPaneSession (which reads the
agent-pane-sessions.jsonl session registry) is verifying a feature with that same feature —
if the registry breaks, the gate false-readies or hangs and masks the bug.
Wait-AgentReady now matches the agent pane buffer for the connected input placeholder
('Ask anything, / for commands..'), which the TUI renders ONLY in ConnectionState::Connected
(ui/input.rs:62; the connecting/disconnected placeholders are distinct strings). That is the
user-visible ground truth of 'ready to chat', independent of the session-tracking feature, and
still returns the instant it's observed (deterministic). Auth/fatal log markers short-circuit.
Verified: AgentRestart 2/2 green (initial connect + reconnect).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR expands and hardens the PowerShell-driven ItE2E end-to-end suite under test/e2e/, adds a release-report generator that maps checklist items to test outcomes, and reduces harness/test flakiness by replacing fixed sleeps with polling and by proactively cleaning up stale Intelligent Terminal instances before launch.
Changes:
- Added new E2E suites for shell integration (OSC 133), agent-proposed command Insert/Run, and an auth-gated multi-agent (Claude/Codex/Gemini) matrix.
- Hardened existing agent-pane coverage (pane positions, /model picker, draft preservation, real Shift+Enter injection) and reduced fixed-delay sleeps in favor of polling.
- Added
New-ReleaseReport.ps1+ a coverage map to generate a checklist-like release report from NUnit/Pester results.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/e2e/tests/Feature.ShellIntegration.Tests.ps1 | New suite validating OSC 133 success/failure marks and cmd.exe “no shell integration” safety. |
| test/e2e/tests/Feature.SessionList.Tests.ps1 | Adds a deterministic assertion that draft input survives a session-view round-trip. |
| test/e2e/tests/Feature.AutofixPane.Tests.ps1 | Replaces fixed sleeps with polling; wraps WSL setup to skip cleanly when unsupported. |
| test/e2e/tests/Feature.AgentRestart.Tests.ps1 | Adds readiness gating around settings-driven reconnect and uses real Shift+Enter injection with retries. |
| test/e2e/tests/Feature.AgentProposedCommand.Tests.ps1 | New suite covering Insert/Run recommendation cards via the non-autofix chat path. |
| test/e2e/tests/Feature.AgentPaneInteraction.Tests.ps1 | Expands open/hide coverage across all pane positions; strengthens /model assertions; verifies helper teardown on tab close. |
| test/e2e/tests/Feature.AgentMatrix.Tests.ps1 | New auth-gated Claude/Codex/Gemini chat + autofix coverage through the ACP adapter. |
| test/e2e/release-coverage-map.psd1 | New checklist-title → test-name regex mapping for release report generation. |
| test/e2e/README.md | Updates suite inventory and status/coverage description to reflect new tests and gating behavior. |
| test/e2e/New-ReleaseReport.ps1 | New script to generate a human-facing release checklist report from NUnit/Pester results + mapping. |
| test/e2e/ItE2E/Public/Harness.ps1 | Adds stale-instance cleanup and makes Start-Terminal always clear stale IT instances before config/launch. |
| test/e2e/ItE2E/Public/AgentInput.ps1 | Adds Send-AgentShiftEnter using win32-input-mode raw sequences. |
| test/e2e/ItE2E/Public/Agent.ps1 | Reworks Wait-AgentReady to gate on user-visible connected placeholder instead of internal artifacts. |
| test/e2e/ItE2E/ItE2E.psm1 | Exports new public harness/input helpers (Stop-StaleItInstances, Send-AgentShiftEnter). |
…ilot review) Match 'Ask anything … for commands' in order on one line instead of either fragment anywhere in the captured scrollback, so stray transcript/help text can't false-positive the readiness gate. Verified: AgentRestart 2/2 green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pilot review) Split the Insert and Run cases into separate Describes, each with its own fresh terminal (matching Feature.AutofixPane). With the shared terminal, a prior card's 'Run command'/'Insert in Terminal' text lingered in the scrollback and could co-occur with the next case's marker (echoed in the prompt) to false-positive the card-readiness check before a fresh card rendered. Verified: both cases 2/2 green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ss comment (Copilot review) - AgentMatrix: report a PRECISE per-agent skip reason (not installed vs installed-but- unauthenticated vs package missing) via Set-ItResult instead of a boolean Context -Skip, so CI shows why; no terminal is launched when skipping. Re-checks package presence in BeforeAll because a script-scoped var from BeforeDiscovery does not persist into the run phase (only the -ForEach data does). - AgentMatrix: chat assertion uses a word-boundary match instead of a bare '7'. - Harness: correct the Stop-StaleItInstances comment - it makes -ColdStart redundant, but -ShowFre still controls whether the FRE overlay is shown. Verified: Claude/Codex chat+autofix pass; Gemini skips with "installed but not authenticated". Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…te checklist Per design decision: Copilot is the primary agent and its full behaviour (chat, autofix, insert/run, permission, render, slash, sessions) is covered in depth by the copilot-only suites. All built-in agents share the same agent-pane -> helper -> master -> agent-CLI (ACP) path; the only per-agent difference is the spawned command. So we stop re-testing every behaviour per agent. - Feature.AgentMatrix.Tests.ps1: collapsed from a per-agent (Claude/Codex/Gemini) x chat+autofix matrix into ONE consolidated test case that, for each installed+authenticated non-Copilot agent, does a single connect + chat round-trip in its own fresh terminal; skips when none is available. - doc/release-check-list.md: collapsed the per-agent items (Claude/Codex/Gemini chat, autofix, delegate, installed, hook-install; and the custom-agent behavioural items) into single consolidated items, keeping Copilot as the primary and the config/selection/tracking items. Total 235 -> 220 items. - release-coverage-map.psd1 + README updated to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tention, /sessions) Adds three high-confidence, deterministic E2E cases (no mock/agent variance) plus one mapping fix, expanding genuine release-checklist coverage: - §9 "WT_COM_CLSID is injected": read $env:WT_COM_CLSID back from a shell pane and assert a braced CLSID, proving WT injects protocol discovery into panes. - §10 "Old log cleanup is safe": seed a sentinel in the running version's log dir + a stale other-version dir, restart the build, assert the running version's logs survive and the stale version dir is pruned wholesale. - §4 "Slash command works": /sessions opens the session view (the command-menu path, complementing the existing button path). - §10 "Early startup failures are logged": coverage-map override (the test "...would be logged" already exists and passes). All three new cases validated live against the Store package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ader ctor throws (Copilot review) - Feature.AutofixPane: CardShown and the two other card-detection predicates (lines 30/102/271) hard-coded the English "Run command|Insert in Terminal" labels, so they'd mis-skip/fail on non-en-US machines. Added an exported Get-RecommendationCardRegex helper (EITHER button label, localized across all bundled locales via Get-WtaLocalizedTextRegex, en-US fallback) and routed all three through it. Verified it matches the en-US card line; the variance-skip path still works live. - Wait-AgentReady: if the StreamReader ctor throws, $fs was left undisposed (file-handle leak). Wrapped the reader in a nested try/finally so $fs is always disposed (double-dispose after the reader closes it is a safe no-op). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tRegex (Copilot review)
Double-quoted YAML scalars were captured raw, so a value containing \" (e.g. setup.subtitle.*
"Your agent \"%{agent}\" …") kept the literal backslashes — the generated regex then looked for
backslashes absent from the rendered UI text, breaking locale-robust assertions for such keys.
Now the double-quoted branch unescapes \" \\ \n \t \r (\x -> x) before the value is regex-escaped.
Verified: setup.subtitle.copilot_missing no longer yields a regex containing \" (the escape is
resolved to a literal "), while the keys the tests actually use (no backslashes) are unchanged.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ll (Copilot review) Four BeforeAll blocks piped Wait-AgentReady to Out-Null, discarding its boolean — so an auth/fatal connect failure would proceed in a not-ready state and surface later as opaque card-polling failures. Assert | Should -BeTrue with a clear -Because in all four (AutofixPane card-render + AutofixPane WSL setup, AgentProposedCommand Insert + Run), so a failed/again-auth connect fails immediately and attributably. (The WSL one is inside the best-effort try/catch, so a readiness failure there is logged and degrades to a skip via the existing per-It $wslShell guards.) Verified live: the Insert BeforeAll assertion passes when copilot connects. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lure mark (Copilot review) - Get-SessionViewRenderRegex: matched the FULL localized agents.footer_hint, but the TUI end-truncates that hint to the pane width (agents_view.rs render_footer_hint -> trunc), so the full line may never appear and Open/Test-SessionListShown could time out. Every bundled locale leads the hint with the invariant nav arrows "↑ ↓" (en "(↑ ↓ to navigate …)", zh "(↑ ↓ 导航 …)"), and being at the start they survive truncation — so match those (en-US footer words kept as an extra fallback). Verified live: the rendered footer matches; the slash-/sessions path is green. - Feature.ShellIntegration failure-mark test: Wait-WtCommandFailure listened to the global vt_sequence stream, so an unrelated OSC 133;D mark could satisfy it. The event's `pane_id` equals the pane session_id (Get-ActivePane.session_id), whereas its `tab_id` is a GUID and Get-ActivePane/Get-WtTabs expose tab_id only as a numeric INDEX — so added a -PaneId filter to Wait-WtCommandFailure and scoped the assertion to the active pane. Verified live: passes scoped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ish reconnect probe (Copilot review) - Feature.AutofixPane Run-action: the card-detection predicate still hard-coded the English "Run command" label (missed in the earlier sweep). Routed it through Get-RecommendationCardRegex like the other card-detection sites so it's locale-robust. - Feature.AgentRestart: removed the post-/restart `Test-Until … -match 'Ask anything|Copilot|Agent'` reconnect probe — it matched hard-coded English (not locale-robust) and was redundant with the Wait-AgentReady | Should -BeTrue gate immediately after, which is the deterministic reconnect-and-ready signal. Verified live: the restart case still passes (37s). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…opilot restart (Copilot review) The override regex 'connects and answers' for "Non-Copilot agents chat works" also matched the Copilot restart test name "(/restart reconnects and answers)" — "reconnects and answers" contains "connects and answers" — which could credit the checklist item from the wrong test in the report. Anchored on "non-Copilot agent.*connects and answers" so it uniquely matches the AgentMatrix case. Verified: matches the AgentMatrix name, does NOT match the Copilot restart name. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DDKinger
approved these changes
Jun 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Expands and hardens the ItE2E end-to-end suite, adds a release-report generator, and
fixes the test-harness flakiness that was producing spurious failures. Net result on the full
Feature run: 0 failed (down from 9), with honest skips only.
All changes are under
test/e2e/(PowerShell harness + tests) — no product code is touched.Coverage added / strengthened
/modelnow opens and asserts the real model picker (not just "menu renders").wta.execount).Send-AgentShiftEnter), instead of substituting a plain Enter.Feature.ShellIntegration.Tests.ps1— §3 OSC 133 marks (success/failure) + non-integratedcmd.exesafety (deterministic).Feature.AgentProposedCommand.Tests.ps1— §2 agent-proposed command Insert/Run into the shell pane via the non-autofix chat path.Feature.AgentMatrix.Tests.ps1— §2/§3 Claude/Codex/Gemini chat + autofix through the ACP adapter, per-CLI auth-gated (runs only when the CLI is installed and authenticated, else skips with the reason recorded).Feature.SessionList.Tests.ps1.Release report generator
test/e2e/New-ReleaseReport.ps1(+release-coverage-map.psd1) turnsdoc/release-check-list.mdinto a clean, human-facing report driven purely by test results:
[UT✓]/[E2E]/[MANUAL]jargon stripped,[x]= verified by automation (UT or E2E),⚠️ AUTOMATION FAILED= a test ran and failed, plain[ ]= verify manually,[x]); merges multiple results files.Latest run: 104 verified / 0 failed / 131 manual (of 235).
Stability fixes (the flaky-failure root causes)
Start-Terminalnow closes any leftover IT window (store + dev) before launching, so a prior test's crashedBeforeAllcan't leave a window that the single-instance launch attaches to in a broken state (new-tab→CreateTab E_FAIL 0x80004005) or that collides on the shared per-brand COM CLSID.Wait-AgentReady— judges readiness by the user-visible connected input placeholder ("Ask anything, / for commands..", rendered only inConnectionState::Connected), not by an internal session-registry artifact, and returns the instant it's observed. Fixes the "agent-pane readiness" timeout flake on initial connect and/restart/settings reconnect.wsl.exetab; Shift+Enter injection retried under load.Verification
The 8 skips are honest/environment-gated (gemini unauthenticated, WSL has no in-distro shell
integration,
wta sessions listidentity-gated, and autofix LLM-variance where the agentreturns an explanation instead of a card).
Follow-ups (not in this PR)
slow; the planned mock ACP agent — Form B (
doc/specs/mock-acp-agent.md,mock-acp-agent.exeas acustom:agent) would make that surface deterministic.Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com