Skip to content

Commit 4fb9b41

Browse files
committed
Document residual unmodeled bypass paths; soften 'CI fails closed' claim
R10 review flagged shell-script direct execution (`bash <script>`, `source <script>`, `./<script>`) and multi-line `python3 -c` bodies as unmodeled. After 10 rounds of progressive guard tightening (R0-R9), the trajectory is clear: each round adds a new shell-feature to model and the reviewer finds the next gap. Continuing further has diminishing return — static-shell-parsing has irreducible limits. Per discussion with author, the 16-test guard is accepted as 'reasonable defense against accidental regressions' rather than a complete adversarial parser. The dismissal's primary defense is the documented threat model (workflow comment block + dismissed_comment field on alert #14), not the test. This commit: - Soften the workflow comment block's "CI fails closed" line. Replace with explicit enumeration of what the test catches (accidental regressions of common forms) AND what it does NOT model (script execution, multi-line python -c, var expansion, eval, find -exec, xargs). Frame the test as belt-and-suspenders. - Expand TestWorkflowDoesNotExecutePRHeadCode docstring with the same SCOPE / NOT-MODELED enumeration. - Add TODO.md row tracking the residuals + reasoning for accepting them. Test invariants preserved (test_workflow_dismissal_comment_block_present still passes — the required strings remain present). 16 guard tests pass; YAML still parses. This is the final commit on PR #436. No more guard-coverage expansion. CodeQL #14 dismissal will be applied via `gh api PATCH` post-merge.
1 parent 46fcb78 commit 4fb9b41

3 files changed

Lines changed: 74 additions & 3 deletions

File tree

.github/workflows/ai_pr_review.yml

Lines changed: 28 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,8 +86,34 @@ jobs:
8686
#
8787
# Guard test: tests/test_openai_review.py
8888
# ::TestWorkflowDoesNotExecutePRHeadCode
89-
# CI fails closed if a forbidden execution pattern is
90-
# introduced.
89+
# Catches the common ACCIDENTAL-regression forms: pip / npm /
90+
# yarn / cargo / make / maturin / poetry / pdm / uv / tox /
91+
# setup.py install/run, python file execution against PR-head
92+
# paths, bash -c / sh -c (including compound flags -lc/-ec/-exc)
93+
# with python/cp inside, env-var prefixes (`VAR=1 python3 ...`),
94+
# wrapper commands (`env`, `nohup`, `exec`, `time`, `command`),
95+
# shell negation (`!`), backslash line continuations, single-line
96+
# `python3 -c` payloads (literal-allowlisted, currently empty),
97+
# subshell `(...)`, brace group `{ ...; }`, and write-overwrites
98+
# of allowlisted /tmp paths between staging and execution
99+
# (`>`, `>>`, `cp`, `mv`, `tee`, `ln`).
100+
#
101+
# NOT exhaustive against an adversarial maintainer. Known
102+
# unmodeled paths (any of these CAN execute PR-head bytes
103+
# without tripping CI; a maintainer reviewing this workflow
104+
# MUST refuse changes that introduce them):
105+
# - bash <script> / sh <script> / ./<script> / source <script>
106+
# / . <script> — direct shell-script execution
107+
# - multi-line `python3 -c` bodies (line-by-line shlex can't
108+
# reassemble across newlines; the workflow's existing 5
109+
# sanitizer bodies are therefore exempt by invisibility)
110+
# - variable expansion: `SCRIPT="$X"; python3 "$SCRIPT"`
111+
# - `eval`, `find -exec`, `xargs -I {}`
112+
# The dismissal's primary defense is THIS comment block plus
113+
# the `dismissed_comment` field on alert #14. The guard test
114+
# is belt-and-suspenders for accidental regressions, not a
115+
# complete adversarial parser.
116+
# See TODO.md for the long-term tracking of this gap.
91117
# ─────────────────────────────────────────────────────────────────
92118
- name: Resolve PR number + metadata
93119
id: pr

TODO.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,7 @@ Deferred items from PR reviews that were not addressed before merge.
145145
| HonestDiD `test_m0_short_circuit` uses wall-clock `elapsed < 0.5s` as a proxy for "short-circuit path taken" instead of calling the full optimizer. Replace with a direct correctness signal (mock/spy the optimizer or check a state flag) so the test doesn't depend on CI timing. Not flaky today at 500ms, but load-bearing correctness on a timing proxy is brittle. | `tests/test_methodology_honest_did.py:246` || Low |
146146
| SyntheticDiD: rename internal `placebo_effects` variable to `variance_effects` (or `resampled_effects`). Misleading name across the placebo/bootstrap/jackknife dispatch paths — holds three different contents depending on variance method. Low-risk refactor; user-facing field rename should preserve `placebo_effects` as a deprecated alias for one release. | `synthetic_did.py`, `results.py` | follow-up | Medium |
147147
| AI review CI: pin workflow contract via test (uses `openai/codex-action@v1`, passes `prompt-file`, reads `steps.run_codex.outputs.final-message`, preserves diff-exclude paths and comment markers). Currently only the wrapper-tag and closing-tag-escape strings are asserted. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #416 | Low |
148+
| `TestWorkflowDoesNotExecutePRHeadCode` (CodeQL #14 dismissal guard) does not model: `bash <script>` / `sh <script>` / `./<script>` / `source <script>` / `. <script>` direct shell-script execution; multi-line `python3 -c` bodies (line-by-line shlex can't reassemble across newlines — the workflow's 5 sanitizer bodies are exempt by invisibility); shell-variable-expansion indirection (`SCRIPT="$X"; python3 "$SCRIPT"`); `eval`; `find -exec`; `xargs -I {}`. Each represents a path by which PR-head bytes COULD execute without the test failing. The guard catches accidental regressions of common forms (16 tests covering pip/npm/cargo/maturin/etc. installs, python file exec, bash -c indirection with compound flags, env-var prefixes, line continuations, subshells/brace groups, single-line python -c, write-overwrites of allowlisted /tmp paths). Closing the residuals would require multi-line shell parsing with command-substitution awareness + script-execution allowlists — significant work for diminishing return given the dismissal's primary defense is the documented threat model on the alert and in `.github/workflows/ai_pr_review.yml` comment block. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #436 | Low |
148149

149150
---
150151

tests/test_openai_review.py

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2670,7 +2670,51 @@ class TestWorkflowDoesNotExecutePRHeadCode:
26702670
the resolve-pr step in `.github/workflows/ai_pr_review.yml` for the
26712671
full rationale and invalidation conditions. The dismissal accepts
26722672
that the workflow CHECKS OUT PR-head content but is valid only
2673-
while the workflow does not EXECUTE that content."""
2673+
while the workflow does not EXECUTE that content.
2674+
2675+
SCOPE — what this test modelS:
2676+
- Common installer/runner commands (pip, npm, yarn, cargo, make,
2677+
maturin, poetry, pdm, uv, tox, setup.py, etc.) via word-boundary
2678+
regex with shell-tokenization
2679+
- Python file execution against PR-head paths (any first non-flag
2680+
positional after `python`/`python3`/`python2`)
2681+
- Allowlisted /tmp python execution paired with EXACT BASE_SHA
2682+
staging command (`git show "${BASE_SHA}":<src> > <tmp>`),
2683+
ordering check (staging before exec), and overwrite check (no
2684+
cp/mv/tee/ln/redirect to the path between staging and exec)
2685+
- bash/sh -c (and compound flags -lc/-ec/-exc) recursively
2686+
classified
2687+
- Subshell `(...)` and brace group `{...}` strip
2688+
- Env-var prefixes (`VAR=1 cmd ...`) and wrapper commands
2689+
(`env`, `nohup`, `exec`, `time`, `command`)
2690+
- Shell negation `!`
2691+
- Backslash line continuations folded before classification
2692+
- Single-line `python3 -c <body>` bodies via literal allowlist
2693+
(`ALLOWED_PYTHON_C_PAYLOADS`), currently empty
2694+
- Step-scoped invariants: Codex `sandbox: read-only`, resolve-pr
2695+
`head_sha = pr.data.head.sha` (API-pinned), open-PR checkout
2696+
pins to `head_repo_full_name` + `head_sha`, comment-trigger
2697+
`author_association` gating
2698+
2699+
SCOPE — what this test does NOT model (residuals tracked in
2700+
TODO.md, accepted as the cost of static-shell-parsing limits):
2701+
- bash <script> / sh <script> / ./<script> / source <script> /
2702+
. <script> direct shell-script execution
2703+
- Multi-line `python3 -c` bodies (line-by-line shlex can't
2704+
reassemble across newlines; the workflow's 5 sanitizer bodies
2705+
are exempt by invisibility)
2706+
- Variable expansion (`SCRIPT="$X"; python3 "$SCRIPT"`)
2707+
- `eval`, `find -exec`, `xargs -I {}`
2708+
2709+
The dismissal's PRIMARY defense is the human-readable comment
2710+
block above the resolve-pr step + the `dismissed_comment` field
2711+
on alert #14, NOT this test. The test catches accidental
2712+
regressions of common forms; it is not a complete adversarial
2713+
parser, and would require modeling more shell semantics than is
2714+
productive in a unit test to become one. See TODO.md for the
2715+
long-term tracking of unmodeled paths and PR #436's review
2716+
history (rounds R0–R10) for the rationale of where the line
2717+
was drawn."""
26742718

26752719
# Word-boundary regexes (label, pattern). Using regex with `\b`
26762720
# boundaries instead of substring matches catches command tokens

0 commit comments

Comments
 (0)