Skip to content

Commit b54ded4

Browse files
cameroncookeclaude
andcommitted
docs: Add device snapshot hang RCA and investigation notes
Capture the root-cause analysis for the physical-device snapshot hang triggered by `devicectl diagnose` interactive auth, along with two broader investigation logs covering device snapshot regressions and CLI/MCP snapshot parity. Add a pointer in AGENTS.md so future runs that hit the same symptom can jump straight to the RCA. Co-Authored-By: Claude <noreply@anthropic.com>
1 parent e6e4ca7 commit b54ded4

4 files changed

Lines changed: 331 additions & 0 deletions

File tree

AGENTS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ Use these sections under `## [Unreleased]`:
5757
- Pattern: `DEVICE_ID=... npm run test:snapshot 2>&1 | tee /tmp/snapshot-results.txt` then read `/tmp/snapshot-results.txt` with the native read tool.
5858
- If you need a summary, read the log file and grep/filter it — the full output is always preserved.
5959
- Snapshot test command: `DEVICE_ID=33689F72-9B74-5406-9842-1CC6A6A96A88 npm run test:snapshot`
60+
- If physical-device snapshot tests hang after the final test summary, the likely cause is Apple post-failure diagnostics invoking `devicectl diagnose`, which may prompt for a macOS password and wedge in automated runs; see `docs/dev/device-snapshot-password-hang-rca.md`.
6061

6162
## **CRITICAL** Tool Usage Rules **CRITICAL**
6263
- NEVER use sed/cat to read a file or a range of a file. Always use the native read tool.
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Investigation: Device snapshot regressions
2+
3+
## Summary
4+
The strongest current evidence points to an Apple tooling issue in the post-test device diagnostics path, not a deterministic regression in XcodeBuildMCP’s shared device workflow code. After a failing physical-device `xcodebuild test`, Xcode launches `devicectl diagnose`; when that runs on a TTY it prints `Password:` and blocks on interactive input, and when Xcode launches it from a background TTY process group the subprocess gets wedged. In non-TTY mode, the same path does not hang and instead exits with a concrete `CoreDeviceCLISupport.DiagnoseError` / `No provider was found` failure.
5+
6+
## Symptoms
7+
- CLI `device build-and-run` success case returned `isError === true` in the snapshot suite.
8+
- CLI `device test` intentional failure snapshot was truncated after discovery/build-stage output and did not include the expected failure footer.
9+
- MCP `device test` intentional failure timed out after 300s in the suite.
10+
11+
## Investigation Log
12+
13+
### Initial assessment
14+
**Hypothesis:** A refactor changed shared device workflow behavior, likely in long-running command/result streaming or failure parsing.
15+
**Findings:** Targeted device test still passes, which suggests device discovery/build bootstrapping still works. Failures are concentrated in `build-and-run` and full-suite `test` failure handling.
16+
**Evidence:** `XcodeBuildMCP/src/snapshot-tests/suites/device-suite.ts:13-211`
17+
**Conclusion:** Needed broader context and direct reproductions.
18+
19+
### Broad code-path review
20+
**Hypothesis:** Shared test/build code or transport handling regressed.
21+
**Findings:** The device suite and entrypoint are effectively unchanged from `main`; the meaningful snapshot-harness changes vs `main` are timeout increases and normalization/MCP envelope handling, not device workflow logic.
22+
**Evidence:**
23+
- `src/snapshot-tests/suites/device-suite.ts` is unchanged in substance versus `main`.
24+
- `src/snapshot-tests/__tests__/device.snapshot.test.ts` is unchanged versus `main`.
25+
- `src/snapshot-tests/harness.ts:8-36` increased CLI snapshot timeout from `120000` to `300000`.
26+
- `src/snapshot-tests/mcp-harness.ts:9-13,93-105` increased MCP timeout from `120000` to `300000` and now prefers `structuredEnvelope.didError`.
27+
- `src/snapshot-tests/normalize.ts` changes are normalization-only.
28+
**Conclusion:** The reported failures are not explained by a direct diff in the device suite itself.
29+
30+
### CLI `build-and-run` direct reproduction
31+
**Hypothesis:** CLI is falsely classifying a successful build-and-run as an error because of render-session error latching.
32+
**Findings:** Isolated `device build-and-run` really failed on the first run. The exit code was `1`, and the output contained a real `devicectl` install failure:
33+
- `Unable to Install “Calculator”`
34+
- `ApplicationVerificationFailed`
35+
- `No code signature found`
36+
A second immediate rerun succeeded with exit code `0`.
37+
**Evidence:**
38+
- Direct run exit code capture: `/tmp/build-run.rc` contained `1`.
39+
- Direct run output: `/tmp/build-run.out` contained the `devicectl` install failure text.
40+
- Immediate rerun: `/tmp/build-run-2.rc` contained `0` and `/tmp/build-run-2.out` showed a complete success transcript.
41+
- The failed build log still ended with successful signing and `** BUILD SUCCEEDED **`: `/Users/cameroncooke/Library/Developer/XcodeBuildMCP/logs/build_run_device_2026-04-16T19-59-14-443Z_pid38742.log`.
42+
**Conclusion:** This is not primarily a false CLI `isError` classification bug. The combined flow hit a genuine but flaky device-side install failure.
43+
44+
### Build artifact validation after failed `build-and-run`
45+
**Hypothesis:** The refactor produced an actually unsigned app artifact.
46+
**Findings:** The app at the resolved path was signed correctly, and a direct `device install` of that same app path succeeded.
47+
**Evidence:**
48+
- `device get-app-path` succeeded and pointed to `~/Library/Developer/XcodeBuildMCP/DerivedData/Build/Products/Debug-iphoneos/CalculatorApp.app`.
49+
- `xcrun codesign -dvvv` against that app showed a valid Apple Development signature, team identifier, `_CodeSignature`, and `embedded.mobileprovision`.
50+
- Direct install succeeded with exit code `0` and output `✅ App installed successfully.` in `/tmp/install-device.out`.
51+
**Conclusion:** The artifact itself was valid. The first `build-and-run` failure is best explained as transient `devicectl` / physical-device install flakiness, not a deterministic signing regression in the build step.
52+
53+
### Fresh DerivedData isolation
54+
**Hypothesis:** Shared default DerivedData corruption is causing the `build-and-run` failure deterministically.
55+
**Findings:** Running `device build-and-run` with a fresh temporary `derivedDataPath` succeeded on the first try.
56+
**Evidence:** `/tmp/fresh-build-run.rc` contained `0`; `/tmp/fresh-build-run.out` contained a full success transcript.
57+
**Conclusion:** There is no evidence of a deterministic combined-flow failure tied solely to the current implementation. The failure is transient/stateful.
58+
59+
### Direct device-test reproductions
60+
**Hypothesis:** `test-common.ts`, parser/finalization code, or CLI/MCP transport handling regressed for failing device tests.
61+
**Findings:** The failing full-device test works normally in isolation and in focused back-to-back runs.
62+
**Evidence:**
63+
- Direct CLI full failing test exited `1` and completed in about 10s with the full footer in `/tmp/device-test.out`.
64+
- Back-to-back direct CLI runs (targeted pass, then full fail) completed normally: `targeted rc=0 dur=8.7s`, `fullfail rc=1 dur=6.8s`.
65+
- Back-to-back snapshot MCP harness calls also completed normally: targeted `isError:false` in ~9.2s, full fail `isError:true` in ~6.1s.
66+
**Conclusion:** The underlying device-test code path is not deterministically broken. The suite-only failure requires a broader state/order interaction.
67+
68+
### Post-test diagnostics reverse engineering
69+
**Hypothesis:** The hang happens after test execution, in Apple’s diagnostics collection path rather than in XcodeBuildMCP parsing/rendering.
70+
**Findings:** That hypothesis is confirmed.
71+
**Evidence:**
72+
- Raw PTY-backed `xcodebuild test` reproduced the exact symptom and captured `Password:` immediately after the final XCTest summary in `/tmp/pty-xcodebuild-test.typescript`.
73+
- During that stall, the live child process was:
74+
- `/Library/Developer/PrivateFrameworks/CoreDevice.framework/Versions/A/Resources/bin/devicectl diagnose --devices 00008140-000278A438E3C01C --no-finder --archive-destination ... --timeout 600`
75+
- Process state showed `devicectl` was running in its own process group on the same TTY, but **not** the foreground TTY process group:
76+
- `pgid=15787`, `tpgid=15723`, `state=T`
77+
- That means the subprocess was stopped after attempting terminal interaction from a background TTY process group.
78+
- Running the same `xcodebuild test` command **without** a TTY did not hang. It completed and printed Xcode’s own diagnostic failure text:
79+
- `Failure collecting diagnostics from devices`
80+
- `No provider was found`
81+
- `CoreDeviceCLISupport.DiagnoseError error 0`
82+
- Running `devicectl diagnose` directly under a PTY, with no `xcodebuild` involved, reproduced the same interactive path in `/tmp/pty-devicectl-diagnose.typescript`:
83+
- provisioning/provider error text
84+
- Apple privacy notice
85+
- `Password:`
86+
- sending a blank line produced `Sorry, try again.` and another `Password:` prompt
87+
- Running `devicectl diagnose` directly **without** a TTY did not prompt; it failed fast and wrote a partial bundle.
88+
**Conclusion:** The bad path is in Apple’s `devicectl diagnose` TTY behavior. `xcodebuild` makes it worse by launching that interactive subprocess from a background terminal process group, which wedges the run.
89+
90+
## Root Cause
91+
The root cause of the hang is Apple’s post-test diagnostics flow for failing physical-device tests:
92+
93+
1. After the failing test summary, `xcodebuild test` launches `devicectl diagnose` to collect device diagnostics.
94+
2. On a TTY, `devicectl diagnose` enters an interactive authorization path and prints `Password:`.
95+
3. When launched by `xcodebuild`, that subprocess is in a background TTY process group, so terminal input cannot be handled normally and the subprocess gets wedged.
96+
4. In non-TTY mode, the same path does not block on a password prompt; it exits with the real underlying diagnostics failure (`No provider was found`, `CoreDeviceCLISupport.DiagnoseError error 0`).
97+
98+
This means the snapshot hang is not primarily caused by:
99+
- `src/utils/test-common.ts`
100+
- `src/utils/xcresult-test-failures.ts`
101+
- `src/utils/xcodebuild-event-parser.ts`
102+
- `src/utils/renderers/cli-text-renderer.ts`
103+
- `src/utils/command.ts`
104+
- `src/snapshot-tests/mcp-harness.ts`
105+
106+
The earlier transient `build-and-run` install failure is real, but it is a separate flaky physical-device issue, not the cause of the `Password:` hang.
107+
108+
## Eliminated hypotheses
109+
- **Deterministic CLI error-latch bug for `build-and-run`** — ruled out by the captured real install failure text and exit code `1`.
110+
- **Deterministic parser/renderer regression dropping the test footer** — ruled out by successful isolated CLI and MCP failing-test runs.
111+
- **Deterministic MCP transport deadlock** — ruled out by successful focused MCP harness pass→fail reproduction.
112+
- **Deterministic signing regression in the built app artifact** — ruled out by successful codesign inspection and direct `device install` of the same app path.
113+
114+
## Recommendations
115+
1. Treat the `Password:` hang as an Apple `devicectl diagnose` / Xcode physical-device diagnostics problem, not as evidence of a deterministic XcodeBuildMCP refactor regression.
116+
2. For XcodeBuildMCP’s automated/device-test paths, prefer **non-interactive process mode** and preserve full stderr/stdout so Xcode’s explicit diagnostics failure is surfaced instead of a hung terminal prompt.
117+
3. Add timeout diagnostics around failing physical-device test runs that explicitly note whether the process appears to be stuck in post-test diagnostics collection.
118+
4. Keep the timeout at `120_000`; increasing it just makes this Apple diagnostics wedge slower to fail.
119+
5. Separately from the hang, keep the earlier `build-and-run` flake in mind as a real but distinct physical-device reliability issue.
120+
121+
## Preventive Measures
122+
- Keep physical-device snapshot tests minimal and isolated.
123+
- Avoid chaining many mutating device operations in one snapshot file.
124+
- Capture direct per-step logs/artifacts for device tests so transient `devicectl` failures are visible without rerunning the whole file.
125+
- Be careful about interpreting longer timeouts as fixes; here they mainly make the suite slower when the device gets wedged.
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# RCA: Physical-device snapshot test hang after failing test
2+
3+
## Summary
4+
When a physical-device `xcodebuild test` run fails, Xcode launches `devicectl diagnose` to collect post-failure diagnostics. On an interactive terminal, `devicectl diagnose` prompts for a local macOS password before starting diagnostics collection. When Xcode launches that same interactive flow from the test process, it can wedge and make device snapshot tests appear to hang.
5+
6+
## Symptoms
7+
- A physical-device snapshot test stalls after the final XCTest summary.
8+
- In a terminal/PTTY run, a transient or persistent `Password:` line appears after the failing test summary.
9+
- MCP/snapshot runs may sit until timeout instead of surfacing the real diagnostics failure.
10+
11+
## Confirmed evidence
12+
- The hang happens after test execution, not during the app test run itself.
13+
- After a failing physical-device test, `xcodebuild` launches:
14+
```bash
15+
/Library/Developer/PrivateFrameworks/CoreDevice.framework/Versions/A/Resources/bin/devicectl diagnose ...
16+
```
17+
- Running `devicectl diagnose` directly on a terminal reproduces the password gate:
18+
- prints Apple privacy text
19+
- prompts with `Password:`
20+
- after local auth, proceeds with diagnostics collection and completes
21+
- Running the same `xcodebuild test` command without a TTY does not hang; instead it exits and prints the underlying diagnostics error:
22+
- `Failure collecting diagnostics from devices`
23+
- `No provider was found`
24+
- `CoreDeviceCLISupport.DiagnoseError error 0`
25+
26+
## Root cause
27+
This is an Apple tooling issue in the post-failure device diagnostics path:
28+
29+
1. A failing physical-device test triggers `devicectl diagnose`.
30+
2. `devicectl diagnose` can require interactive local macOS authentication.
31+
3. In the `xcodebuild`-launched context, that interactive auth path may not be able to complete cleanly, so the run wedges.
32+
33+
## Scope
34+
This explains hangs after failing physical-device tests. It does not explain unrelated build/install flake in other device flows.
35+
36+
## Practical guidance
37+
- If a physical-device snapshot test hangs after the final test summary, check for a `Password:` prompt.
38+
- Prefer non-interactive execution when automating this flow so the command fails fast instead of hanging.
39+
- Do not treat longer timeouts as a fix; they only make this Apple diagnostics wedge slower to fail.
40+
41+
## Useful manual repro
42+
```bash
43+
/Library/Developer/PrivateFrameworks/CoreDevice.framework/Versions/A/Resources/bin/devicectl diagnose \
44+
--devices <PHYSICAL_DEVICE_UDID> \
45+
--no-finder \
46+
--archive-destination /tmp/devicectl-live-sample.zip \
47+
--timeout 600
48+
```
49+
50+
If prompted and local auth succeeds, the command proceeds with diagnostics collection and writes a zip archive.

0 commit comments

Comments
 (0)