Document residual unmodeled bypass paths; soften 'CI fails closed' claim

igerber · igerber · commit 4fb9b414de2c · 2026-05-15T06:35:09.000-04:00
R10 review flagged shell-script direct execution (`bash <script>`, `source <script>`, `./<script>`) and multi-line `python3 -c` bodies as unmodeled. After 10 rounds of progressive guard tightening (R0-R9), the trajectory is clear: each round adds a new shell-feature to model and the reviewer finds the next gap. Continuing further has diminishing return — static-shell-parsing has irreducible limits. Per discussion with author, the 16-test guard is accepted as 'reasonable defense against accidental regressions' rather than a complete adversarial parser. The dismissal's primary defense is the documented threat model (workflow comment block + dismissed_comment field on alert #14), not the test. This commit: - Soften the workflow comment block's "CI fails closed" line. Replace with explicit enumeration of what the test catches (accidental regressions of common forms) AND what it does NOT model (script execution, multi-line python -c, var expansion, eval, find -exec, xargs). Frame the test as belt-and-suspenders. - Expand TestWorkflowDoesNotExecutePRHeadCode docstring with the same SCOPE / NOT-MODELED enumeration. - Add TODO.md row tracking the residuals + reasoning for accepting them. Test invariants preserved (test_workflow_dismissal_comment_block_present still passes — the required strings remain present). 16 guard tests pass; YAML still parses. This is the final commit on PR #436. No more guard-coverage expansion. CodeQL #14 dismissal will be applied via `gh api PATCH` post-merge.
diff --git a/.github/workflows/ai_pr_review.yml b/.github/workflows/ai_pr_review.yml
@@ -86,8 +86,34 @@ jobs:
       #
       # Guard test: tests/test_openai_review.py
       #   ::TestWorkflowDoesNotExecutePRHeadCode
-      # CI fails closed if a forbidden execution pattern is
-      # introduced.
+      # Catches the common ACCIDENTAL-regression forms: pip / npm /
+      # yarn / cargo / make / maturin / poetry / pdm / uv / tox /
+      # setup.py install/run, python file execution against PR-head
+      # paths, bash -c / sh -c (including compound flags -lc/-ec/-exc)
+      # with python/cp inside, env-var prefixes (`VAR=1 python3 ...`),
+      # wrapper commands (`env`, `nohup`, `exec`, `time`, `command`),
+      # shell negation (`!`), backslash line continuations, single-line
+      # `python3 -c` payloads (literal-allowlisted, currently empty),
+      # subshell `(...)`, brace group `{ ...; }`, and write-overwrites
+      # of allowlisted /tmp paths between staging and execution
+      # (`>`, `>>`, `cp`, `mv`, `tee`, `ln`).
+      #
+      # NOT exhaustive against an adversarial maintainer. Known
+      # unmodeled paths (any of these CAN execute PR-head bytes
+      # without tripping CI; a maintainer reviewing this workflow
+      # MUST refuse changes that introduce them):
+      #   - bash <script> / sh <script> / ./<script> / source <script>
+      #     / . <script>  — direct shell-script execution
+      #   - multi-line `python3 -c` bodies (line-by-line shlex can't
+      #     reassemble across newlines; the workflow's existing 5
+      #     sanitizer bodies are therefore exempt by invisibility)
+      #   - variable expansion: `SCRIPT="$X"; python3 "$SCRIPT"`
+      #   - `eval`, `find -exec`, `xargs -I {}`
+      # The dismissal's primary defense is THIS comment block plus
+      # the `dismissed_comment` field on alert #14. The guard test
+      # is belt-and-suspenders for accidental regressions, not a
+      # complete adversarial parser.
+      # See TODO.md for the long-term tracking of this gap.
       # ─────────────────────────────────────────────────────────────────
       - name: Resolve PR number + metadata
         id: pr
diff --git a/TODO.md b/TODO.md
@@ -145,6 +145,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | HonestDiD `test_m0_short_circuit` uses wall-clock `elapsed < 0.5s` as a proxy for "short-circuit path taken" instead of calling the full optimizer. Replace with a direct correctness signal (mock/spy the optimizer or check a state flag) so the test doesn't depend on CI timing. Not flaky today at 500ms, but load-bearing correctness on a timing proxy is brittle. | `tests/test_methodology_honest_did.py:246` | — | Low |
 | SyntheticDiD: rename internal `placebo_effects` variable to `variance_effects` (or `resampled_effects`). Misleading name across the placebo/bootstrap/jackknife dispatch paths — holds three different contents depending on variance method. Low-risk refactor; user-facing field rename should preserve `placebo_effects` as a deprecated alias for one release. | `synthetic_did.py`, `results.py` | follow-up | Medium |
 | AI review CI: pin workflow contract via test (uses `openai/codex-action@v1`, passes `prompt-file`, reads `steps.run_codex.outputs.final-message`, preserves diff-exclude paths and comment markers). Currently only the wrapper-tag and closing-tag-escape strings are asserted. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #416 | Low |
+| `TestWorkflowDoesNotExecutePRHeadCode` (CodeQL #14 dismissal guard) does not model: `bash <script>` / `sh <script>` / `./<script>` / `source <script>` / `. <script>` direct shell-script execution; multi-line `python3 -c` bodies (line-by-line shlex can't reassemble across newlines — the workflow's 5 sanitizer bodies are exempt by invisibility); shell-variable-expansion indirection (`SCRIPT="$X"; python3 "$SCRIPT"`); `eval`; `find -exec`; `xargs -I {}`. Each represents a path by which PR-head bytes COULD execute without the test failing. The guard catches accidental regressions of common forms (16 tests covering pip/npm/cargo/maturin/etc. installs, python file exec, bash -c indirection with compound flags, env-var prefixes, line continuations, subshells/brace groups, single-line python -c, write-overwrites of allowlisted /tmp paths). Closing the residuals would require multi-line shell parsing with command-substitution awareness + script-execution allowlists — significant work for diminishing return given the dismissal's primary defense is the documented threat model on the alert and in `.github/workflows/ai_pr_review.yml` comment block. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #436 | Low |
 
 ---
 
diff --git a/tests/test_openai_review.py b/tests/test_openai_review.py
@@ -2670,7 +2670,51 @@ class TestWorkflowDoesNotExecutePRHeadCode:
     the resolve-pr step in `.github/workflows/ai_pr_review.yml` for the
     full rationale and invalidation conditions. The dismissal accepts
     that the workflow CHECKS OUT PR-head content but is valid only
-    while the workflow does not EXECUTE that content."""
+    while the workflow does not EXECUTE that content.
+
+    SCOPE — what this test modelS:
+    - Common installer/runner commands (pip, npm, yarn, cargo, make,
+      maturin, poetry, pdm, uv, tox, setup.py, etc.) via word-boundary
+      regex with shell-tokenization
+    - Python file execution against PR-head paths (any first non-flag
+      positional after `python`/`python3`/`python2`)
+    - Allowlisted /tmp python execution paired with EXACT BASE_SHA
+      staging command (`git show "${BASE_SHA}":<src> > <tmp>`),
+      ordering check (staging before exec), and overwrite check (no
+      cp/mv/tee/ln/redirect to the path between staging and exec)
+    - bash/sh -c (and compound flags -lc/-ec/-exc) recursively
+      classified
+    - Subshell `(...)` and brace group `{...}` strip
+    - Env-var prefixes (`VAR=1 cmd ...`) and wrapper commands
+      (`env`, `nohup`, `exec`, `time`, `command`)
+    - Shell negation `!`
+    - Backslash line continuations folded before classification
+    - Single-line `python3 -c <body>` bodies via literal allowlist
+      (`ALLOWED_PYTHON_C_PAYLOADS`), currently empty
+    - Step-scoped invariants: Codex `sandbox: read-only`, resolve-pr
+      `head_sha = pr.data.head.sha` (API-pinned), open-PR checkout
+      pins to `head_repo_full_name` + `head_sha`, comment-trigger
+      `author_association` gating
+
+    SCOPE — what this test does NOT model (residuals tracked in
+    TODO.md, accepted as the cost of static-shell-parsing limits):
+    - bash <script> / sh <script> / ./<script> / source <script> /
+      . <script> direct shell-script execution
+    - Multi-line `python3 -c` bodies (line-by-line shlex can't
+      reassemble across newlines; the workflow's 5 sanitizer bodies
+      are exempt by invisibility)
+    - Variable expansion (`SCRIPT="$X"; python3 "$SCRIPT"`)
+    - `eval`, `find -exec`, `xargs -I {}`
+
+    The dismissal's PRIMARY defense is the human-readable comment
+    block above the resolve-pr step + the `dismissed_comment` field
+    on alert #14, NOT this test. The test catches accidental
+    regressions of common forms; it is not a complete adversarial
+    parser, and would require modeling more shell semantics than is
+    productive in a unit test to become one. See TODO.md for the
+    long-term tracking of unmodeled paths and PR #436's review
+    history (rounds R0–R10) for the rationale of where the line
+    was drawn."""
 
     # Word-boundary regexes (label, pattern). Using regex with `\b`
     # boundaries instead of substring matches catches command tokens