Wire CI AI reviewer to see tutorial notebook prose#414
Conversation
Overall Assessment⛔ Blocker — one unmitigated P0 security issue and one P1 sibling-surface regression. Executive Summary
MethodologyNo findings. This PR does not change estimators, math, weighting, variance/SE, identification assumptions, or statistical defaults. Code QualityP1 — Local reviewer prompt adaptation is now stale
PerformanceNo P1/P2 findings. The per-output MaintainabilityP3 — Extractor policy is documented
Tech DebtNo findings. I did not find a relevant SecurityP0 — Workflow executes PR-controlled Python before a secret-bearing step
Documentation/TestsP1 — Existing local-review tests need to be updated with the prompt change
Path To Approval
|
P0 (Security): The previous workflow ran `python3 tools/notebook_md_extract.py`
from the PR checkout, before the secret-bearing Codex step. A malicious PR could
modify the script + add a tutorial notebook to tamper with the prompt or attack
the later step. Particularly risky for `/ai-review` `issue_comment` reruns on
fork PRs (workflow runs in base-repo context).
Fix: stage the trusted extractor from the BASE branch via
`git show "$BASE_SHA:tools/notebook_md_extract.py" > /tmp/notebook_md_extract.py`
in a dedicated step BEFORE prompt building. The PR-controlled version of the
script is no longer executed at all. Bootstrap case (this PR — base does not
yet contain the script): extraction is skipped with a placeholder line; once
this PR merges, all subsequent PRs see the trusted extractor on base.
Defense in depth: add `persist-credentials: false` to `actions/checkout@v6`
so the GITHUB_TOKEN is not embedded in `.git/config` during prompt building
(blocks secondary token exfiltration via `git config` reads). The pre-fetch
step is updated to pass the token explicitly via
`http.extraheader="Authorization: bearer ${GH_TOKEN}"`.
P1 (Substitution drift): `_adapt_review_criteria()` in
`.claude/scripts/openai_review.py` exact-matches the CI Single-Pass mandate
block, including the `docs/tutorials/*.ipynb` line that was removed. The
substitution silently failed on the prior commit, leaving CI-only audit
instructions in the local-review prompt and emitting a stderr warning.
Fix: update the expected text in `_SUBSTITUTIONS` to match the current
`pr_review.md` block (drop the tutorials line; include the new
"Tutorial notebook prose" paragraph). All 6 tests in
`tests/test_openai_review.py::TestAdaptReviewCriteria` now pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentExecutive Summary
MethodologyNo findings.
Code QualityNo unmitigated findings. Prior local-review prompt regression is resolved: PerformanceP2 — Notebook extraction has only per-output caps, not total prompt caps
MaintainabilityNo findings. The extractor’s omission policy for HTML-only outputs, images, and raw cells is documented in Tech DebtNo mitigating TODO entry found for the P1 below. The P2 CI/path-filter gap is deferrable but currently untracked. SecurityP1 — [Newly identified] PR-controlled prompt/prose can still steer the AI reviewer
Documentation/TestsP2 — Tool tests are not triggered by tool-only changes
Verification note: Path To Approval
|
P1 (Security, [Newly identified]): The R1 fix staged the notebook extractor from BASE_SHA, but the workflow still ran `cat .github/codex/prompts/pr_review.md` from the PR checkout — meaning a PR could rewrite the reviewer instructions used to review itself. The PR-controlled notebook prose was also appended without an untrusted-content boundary, so a malicious notebook cell could include "ignore prior directions"-style content competing with the reviewer criteria. Fix: stage the reviewer prompt from BASE_SHA via `git show "$BASE_SHA:.github/codex/prompts/pr_review.md" > /tmp/pr_review.md` in the same step that stages the extractor (fail-closed if base lacks the prompt — no bootstrap fallback). The prompt-build step now uses `/tmp/pr_review.md` instead of the PR-checkout version. Wrap the notebook extraction loop output in `<untrusted-pr-content marker="...">` tags, and extend the reviewer prompt's "Tutorial notebook prose" paragraph to instruct the model to review the prose for correctness but ignore any directive inside the wrapper. The same rule explicitly extends to `<previous-ai-review-output>` and PR title/body text. P2 (Performance): The previous `--max-output-chars 20000` cap was per-output only — a PR with many small outputs could still bloat the prompt. Add `--max-total-chars` to the extractor with a single truncation marker appended after the cap. Workflow invocation passes `--max-total-chars 200000` for a per-notebook cap; the loop accumulates across multiple changed notebooks but each individual notebook is bounded. P2 (Tests/CI): The `rust-test.yml` path filters covered `tests/**` but not `tools/**`. A future PR that changes only `tools/notebook_md_extract.py` would skip the test job. Added `tools/**` to both push and pull_request path filters. `_SUBSTITUTIONS` in `.claude/scripts/openai_review.py` updated to mirror the new prompt block. New test `test_max_total_chars_truncates_whole_notebook` locks the per-notebook truncation behavior. All 16 affected tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
MethodologyNo findings.
Code QualityNo findings.
PerformanceNo findings.
MaintainabilityNo findings.
Tech DebtNo findings.
SecurityNo findings.
Documentation/TestsNo findings.
Verification: |
Closes the gap from PR #409 where the CI AI reviewer ran 3+ rounds blind to tutorial notebook prose because the workflow excluded `docs/tutorials/*.ipynb` from the diff. **Extractor** (`tools/notebook_md_extract.py`, +95 LoC): stdlib-only Jupyter notebook → Markdown converter. `_to_str()` coerces nbformat raw JSON's list-or-string `source` / `text` fields (88%/100% list-form rates). `--max-output-chars 20000` caps each text/plain or stream output; `--max-total-chars 200000` caps the whole notebook. text/html-only outputs, image/* data, and raw cells are dropped (documented in module docstring + --help). **Workflow** (`.github/workflows/ai_pr_review.yml`): three trusted-from-base sources staged via `git show "$BASE_SHA:..." > /tmp/...` — `pr_review.md`, `openai_review.py`, and `notebook_md_extract.py`. The trusted invocation uses `/tmp/openai_review.py --review-criteria /tmp/pr_review.md` so a malicious PR cannot rewrite reviewer instructions, exfiltrate `OPENAI_API_KEY` via the script, or inject code before the secret-bearing API call. `actions/checkout@v6` uses `persist-credentials: false` and the pre-fetch step passes `GITHUB_TOKEN` via `http.extraheader` (env-scoped) to block secondary token exfiltration via `.git/config` reads. Notebook prose extraction runs the trusted extractor on changed tutorials and writes to `/tmp/notebook-prose.md` (fail-soft per notebook: a malformed notebook degrades to a placeholder line rather than killing the AI review job). **Prompt** (`.github/codex/prompts/pr_review.md` + `openai_review.py`): Section 5 drops the `docs/tutorials/*.ipynb` DO-NOT line and adds a "Tutorial Notebook Prose" paragraph that directs the reviewer to the new prompt block. The block is wrapped in `<notebook-prose untrusted="true">` (mirroring PR #415's `<pr-body>` / `<previous-review-output>` conventions); the reviewer is instructed to review the prose for correctness but ignore any directive inside the wrapper. `compile_prompt()` renders the section after the diff (fresh or delta mode) and before Full Source Files. `_MANDATE_SUBSTITUTIONS` updated to match (drift-check: 12/12 TestAdaptReviewCriteria cases pass with zero substitution warnings). **Sanitization**: refactored the three close-tag escapers into a shared `_sanitize_wrapper_tag(text, tag_name)` helper. `_sanitize_pr_body` is kept as a backward-compatible thin wrapper. Previous-review-output's inline regex is replaced with the helper. **Tests** (`tests/test_openai_review.py` +219 LoC, `tests/test_notebook_md_extract.py` +190 LoC new): inline-fixture extractor suite with skip-guard on `tools/` existence for the isolated-install matrix; compile_prompt ordering pins for fresh + delta modes via explicit `text.index()` assertions; parametrized close-tag-variant parity tests across all three wrappers; supply-chain workflow-text assertions for the three `git show "$BASE_SHA:..."` invocations and `persist-credentials: false`; TestMainCLIPropagation extension for `--notebook-prose`. **rust-test.yml**: `tools/**` added to push + PR path filters so future extractor-only changes trigger the test job. **T21 workaround reap**: `docs/_review/t21_notebook_extract.md` (450 lines, the one-shot extract from PR #409) and the `_review` entry in `docs/conf.py:exclude_patterns` were left behind on origin/main; both are removed here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e52cdf2 to
6e9bd71
Compare
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Executive Summary
MethodologyNo methodology defects found.
Code QualityNo blocking findings.
PerformanceNo blocking findings.
MaintainabilityNo blocking findings.
Tech DebtNo blocking findings.
SecurityNo blocking findings.
Documentation/Tests[Newly identified] Missing workflow-contract tests for explicitly shipped notebook extraction behavior
Path to Approval
|
Summary
:!docs/tutorials/*.ipynbexclusion in.github/workflows/ai_pr_review.ymlwith a markdown-extraction step that loops over changed tutorial notebooks and appends prose + code + executed outputs to the reviewer prompt.docs/tutorials/*.ipynbfrom the DO-NOT list at.github/codex/prompts/pr_review.mdand add a pointer to the new "Tutorial notebook prose" prompt block.docs/_review/t21_notebook_extract.mdand the_reviewentry indocs/conf.py:exclude_patterns.tools/notebook_md_extract.py(stdlib-only — nonbformatdep, nopip installstep in the workflow) with a_to_str()helper that coerces nbformat raw JSON's list-or-stringsource/textfields.text/html-only outputs,image/*data, andrawcells are intentionally dropped (see module docstring).--max-output-charsargument caps individual outputs with a truncation marker. Workflow invocation passes--max-output-chars 20000to bound prompt growth.|| echo "(extraction failed ...)") so one malformed notebook degrades to a placeholder line rather than killing the AI review job. Job is best-effort (no merge gating).Methodology references (required if estimator / math changes)
Validation
tests/test_notebook_md_extract.py(9 cases: list-vs-stringsourcecoercion,_to_strhelper, HTML-only / image / raw / error output handling,--max-output-charstruncation, CLI--input/--output, CLI stdout). Inline-fixture pattern with skip-guard ontools/notebook_md_extract.pyexistence so the test runs cleanly inrust-test.yml's isolated-install matrix.**Output:**blocks render correctly. Workflow YAML validated viayaml.safe_load. Sphinx config still parses with_reviewremoved fromexclude_patterns.Security / privacy
Generated with Claude Code