Merge pull request #404 from igerber/update-codex-review

igerber · web-flow · commit 7b6805c62492 · 2026-05-09T13:51:49.000-04:00
Bump AI PR review to gpt-5.5 + add Single-Pass Completeness Mandate
diff --git a/.claude/commands/ai-review-local.md b/.claude/commands/ai-review-local.md
@@ -22,14 +22,14 @@ pre-PR use. Designed for iterative review/revision cycles before submitting a PR
   files (default: 200000). Changed source files are always included regardless of budget.
 - `--force-fresh`: Skip delta-diff mode, run a full fresh review even if previous state exists
 - `--full-registry`: Include the entire REGISTRY.md instead of selective sections
-- `--model <name>`: Override the OpenAI model (default: `gpt-5.4`)
-- `--timeout <seconds>`: HTTP request timeout (default: 300). Use 900 for reasoning models.
+- `--model <name>`: Override the OpenAI model (default: `gpt-5.5`)
+- `--timeout <seconds>`: HTTP request timeout. If omitted, defaults to 900 for reasoning models (gpt-5.4, gpt-5.5, *-pro, o1/o3/o4) and 300 otherwise.
 - `--dry-run`: Print the compiled prompt without calling the API
 
-**Reasoning models** (`gpt-5.4-pro`, `o3`, `o4-mini`, etc.): Reviews may take 10-15
+**Reasoning models** (`gpt-5.5`, `gpt-5.5-pro`, `o3`, `o4-mini`, etc.): Reviews may take 10-15
 minutes. For deep reviews with reasoning models, combine `--token-budget` with `--model`:
 ```
-/ai-review-local --model gpt-5.4-pro --token-budget 500000 --context deep
+/ai-review-local --model gpt-5.5-pro --token-budget 500000 --context deep
 ```
 
 ## Constraints
@@ -47,7 +47,7 @@ before any data is sent externally.
 ### Step 1: Parse Arguments
 
 Parse `$ARGUMENTS` for the optional flags listed above. All flags are optional —
-the default behavior (standard context, selective registry, gpt-5.4, live API call)
+the default behavior (standard context, selective registry, gpt-5.5, live API call)
 requires no arguments.
 
 ### Step 2: Validate Prerequisites
@@ -334,9 +334,15 @@ python3 .claude/scripts/openai_review.py \
 Note: `--force-fresh` is a skill-only flag — it controls whether delta diffs are
 generated in Step 4 and is NOT passed to the script.
 
-**Reasoning model handling:** If the model contains `-pro` or starts with `o1`/`o3`/`o4`
-(e.g., `gpt-5.4-pro`, `o3`, `o4-mini`):
-- Pass `--timeout 900` to the script (unless the user explicitly specified `--timeout`)
+**Reasoning model handling:** Resolve the effective model first — `effective_model` is
+the value of `--model` if the user provided one, otherwise the script default `gpt-5.5`.
+The `--model`, `--timeout`, and `--dry-run` flags pass through to the script when provided.
+
+If `effective_model` contains `-pro`, starts with `o1`/`o3`/`o4`, or starts with
+`gpt-5.4`/`gpt-5.5` (e.g., `gpt-5.5`, `gpt-5.5-pro`, `o3`, `o4-mini`):
+- The script's `_resolve_timeout()` already auto-selects 900s for these models when
+  `--timeout` is omitted, so no wrapper timeout pass-through is required. (Passing
+  `--timeout 900` explicitly remains harmless and is fine for backward compatibility.)
 - Run the Bash command with `run_in_background: true` (bypasses the 600s Bash tool timeout cap)
 - After the background command completes, continue to Step 6
 
@@ -466,7 +472,7 @@ runs `--force-fresh` or when a rebase invalidates the tracked commit.
 /ai-review-local --model gpt-4.1 --full-registry
 
 # Deep review with reasoning model (may take 10-15 minutes)
-/ai-review-local --model gpt-5.4-pro --token-budget 500000 --context deep
+/ai-review-local --model gpt-5.5-pro --token-budget 500000 --context deep
 
 # Limit token budget for faster/cheaper reviews
 /ai-review-local --token-budget 100000
@@ -501,7 +507,7 @@ runs `--force-fresh` or when a rebase invalidates the tracked commit.
 - **Data transmission**: In non-dry-run mode, this skill transmits the unified diff,
   changed-file metadata, full source file contents (in standard/deep mode),
   import-context files (in deep mode), selected methodology registry text, and
-  prior review context (if present) to OpenAI via the Chat Completions API.
+  prior review context (if present) to OpenAI via the Responses API.
   Use `--dry-run` to preview exactly what would be sent.
 - This skill pairs naturally with the iterative workflow:
   `/ai-review-local` -> address findings -> `/ai-review-local` -> `/submit-pr`
diff --git a/.claude/scripts/openai_review.py b/.claude/scripts/openai_review.py
@@ -853,6 +853,8 @@ def apply_token_budget(
 # Source: https://platform.openai.com/docs/pricing
 # MAINTENANCE: Update when OpenAI changes pricing.
 PRICING = {
+    "gpt-5.5": (5.00, 30.00),
+    "gpt-5.5-pro": (30.00, 180.00),
     "gpt-5.4": (2.50, 15.00),
     "gpt-5.4-pro": (30.00, 180.00),
     "gpt-4.1": (2.00, 8.00),
@@ -900,8 +902,8 @@ def estimate_cost(
         "code changes that have not yet been submitted as a pull request.",
     ),
     (
-        "Review ONLY the changes introduced by this PR (diff)",
-        "Review ONLY the changes shown in the diff below",
+        "Review the changes introduced by this PR (diff)",
+        "Review the code changes shown in the diff below",
     ),
     (
         "If the PR changes an estimator",
@@ -933,6 +935,62 @@ def estimate_cost(
         "Use the branch name only to understand which "
         "methods/papers are intended.",
     ),
+    # Replace the CI Single-Pass Completeness Mandate with a local-mode note.
+    # The CI Mandate instructs the reviewer to run shell greps, load sibling
+    # files, and sweep transitive paths — none of which the local raw API
+    # path can do. Leaving the CI wording in place would cause the model to
+    # claim audits it cannot perform.
+    (
+        """## Single-Pass Completeness Mandate (Initial Review Only)
+
+This is an INITIAL review. Treat this as the only chance to enumerate findings.
+Follow-up rounds are expensive — find ALL P0/P1/P2 issues in this pass.
+
+Before finalizing, confirm you have run each of these audits on the diff:
+
+1. **Sibling-surface mirror audit**: For every fix or change in a method, schema,
+   default-value path, or report block, identify the parallel surface in the same
+   codebase (BR ↔ DR, schema ↔ renderer, default ↔ precomputed, summary ↔ full)
+   and check whether the same change applies there. Flag the unmirrored side as P1.
+
+2. **Pattern-wide grep**: When you flag any anti-pattern or bug class, use `grep`
+   on `diff_diff/**.py` to identify sibling occurrences of the same pattern and
+   enumerate them in the SAME finding. Only LOAD a sibling file's full contents
+   if grep returns a hit and you need surrounding context to verify the issue.
+   Do not defer pattern-class findings to a follow-up round.
+
+3. **Reciprocal/symmetry check**: For dispatch code, validation, or guards in
+   one direction (A-on-B), explicitly enumerate the reciprocal direction (B-on-A)
+   and confirm coverage.
+
+4. **Transitive workflow deps**: For GH Actions workflow `paths:` or pytest
+   selection changes, sweep transitive auto-loaded files (conftest.py,
+   pyproject.toml, ancestor conftests) and confirm they are included.
+
+5. **Scope override (with carve-outs)**: The audits above explicitly authorize
+   loading files outside the diff to verify completeness. This overrides the
+   "minimum surrounding context" default in the Rules section below.
+
+   **DO NOT load these paths** (the workflow's diff-build deliberately excludes
+   them; they are noise or out-of-scope):
+   - `docs/tutorials/*.ipynb` (notebook outputs are large JSON blobs)
+   - `benchmarks/data/real/*.json`
+   - `benchmarks/data/real/*.csv`""",
+        """## Single-Pass Completeness Audit (Local Review)
+
+This is a local review running as a static-prompt API call. You do NOT have
+shell or file-loading access — only the prompt content below is available
+(diff + changed source files + first-level imports).
+
+Find ALL P0/P1/P2 issues within the loaded context. Audit sibling surfaces,
+parallel patterns, and reciprocal directions THAT ARE VISIBLE in the loaded
+files.
+
+Do NOT claim to have run shell greps, loaded sibling files outside the
+prompt, or audited paths not present here. If a relevant audit is impossible
+because the necessary context is not in the prompt, say so explicitly rather
+than asserting completeness.""",
+    ),
 ]
 
 
@@ -1095,15 +1153,30 @@ def compile_prompt(
 # ---------------------------------------------------------------------------
 
 ENDPOINT = "https://api.openai.com/v1/responses"
-DEFAULT_MODEL = "gpt-5.4"
-DEFAULT_TIMEOUT = 300  # seconds
+DEFAULT_MODEL = "gpt-5.5"
+DEFAULT_TIMEOUT = 300  # seconds — non-reasoning models
+REASONING_TIMEOUT = 900  # seconds — reasoning models can take 10-15 min
 DEFAULT_MAX_TOKENS = 16384
 REASONING_MAX_TOKENS = 32768
 
 
 def _is_reasoning_model(model: str) -> bool:
     """Return True for models that use internal chain-of-thought reasoning."""
-    return model.startswith(("o1", "o3", "o4")) or "-pro" in model
+    return model.startswith(("o1", "o3", "o4", "gpt-5.4", "gpt-5.5")) or "-pro" in model
+
+
+def _resolve_timeout(timeout: "int | None", model: str) -> int:
+    """Resolve the effective HTTP timeout for an API call.
+
+    If --timeout was explicitly provided, use it. Otherwise pick
+    REASONING_TIMEOUT (900s) for reasoning models and DEFAULT_TIMEOUT
+    (300s) for standard models. This prevents reasoning-model reviews
+    from hitting a too-short default when the wrapper command does not
+    pass --timeout.
+    """
+    if timeout is not None:
+        return timeout
+    return REASONING_TIMEOUT if _is_reasoning_model(model) else DEFAULT_TIMEOUT
 
 
 def estimate_tokens(text: str) -> int:
@@ -1134,13 +1207,19 @@ def call_openai(
     prompt: str,
     model: str,
     api_key: str,
-    timeout: int = DEFAULT_TIMEOUT,
+    timeout: "int | None" = None,
 ) -> "tuple[str, dict]":
     """Call the OpenAI Responses API.
 
+    If ``timeout`` is None, resolves to REASONING_TIMEOUT (900s) for
+    reasoning models and DEFAULT_TIMEOUT (300s) otherwise — same logic
+    as the CLI ``--timeout`` flag. This guards future direct callers
+    against the old 300s-everywhere default.
+
     Returns (content, usage) where usage is the API response's usage dict
     containing input_tokens and output_tokens.
     """
+    timeout = _resolve_timeout(timeout, model)
     reasoning = _is_reasoning_model(model)
     max_tokens = REASONING_MAX_TOKENS if reasoning else DEFAULT_MAX_TOKENS
 
@@ -1343,8 +1422,12 @@ def main() -> None:
     parser.add_argument(
         "--timeout",
         type=int,
-        default=DEFAULT_TIMEOUT,
-        help=f"HTTP request timeout in seconds (default: {DEFAULT_TIMEOUT})",
+        default=None,
+        help=(
+            f"HTTP request timeout in seconds. If omitted, defaults to "
+            f"{REASONING_TIMEOUT} for reasoning models and {DEFAULT_TIMEOUT} for "
+            f"standard models."
+        ),
     )
     parser.add_argument(
         "--delta-diff",
@@ -1374,6 +1457,10 @@ def main() -> None:
 
     args = parser.parse_args()
 
+    # Resolve --timeout: reasoning models default to REASONING_TIMEOUT (900s),
+    # standard models to DEFAULT_TIMEOUT (300s). Explicit --timeout overrides.
+    args.timeout = _resolve_timeout(args.timeout, args.model)
+
     # Post-parse validation
     if args.context != "minimal" and not args.repo_root:
         parser.error(
@@ -1614,13 +1701,7 @@ def main() -> None:
         sys.exit(0)
 
     # Call OpenAI API
-    if _is_reasoning_model(args.model) and args.timeout == DEFAULT_TIMEOUT:
-        print(
-            f"Note: {args.model} is a reasoning model. Consider --timeout 900 "
-            "for large reviews.",
-            file=sys.stderr,
-        )
-    print(f"Sending review to {args.model}...", file=sys.stderr)
+    print(f"Sending review to {args.model} (timeout: {args.timeout}s)...", file=sys.stderr)
     print(f"Estimated input tokens: ~{est_tokens:,}", file=sys.stderr)
     if cost_str:
         print(f"Estimated cost: {cost_str}", file=sys.stderr)
diff --git a/.github/codex/prompts/pr_review.md b/.github/codex/prompts/pr_review.md
@@ -54,6 +54,42 @@ When reviewing new features or code paths, specifically check:
    - Command to check: `grep -n "pattern" diff_diff/*.py`
    - Flag as P1 if only partial fixes were made
 
+## Single-Pass Completeness Mandate (Initial Review Only)
+
+This is an INITIAL review. Treat this as the only chance to enumerate findings.
+Follow-up rounds are expensive — find ALL P0/P1/P2 issues in this pass.
+
+Before finalizing, confirm you have run each of these audits on the diff:
+
+1. **Sibling-surface mirror audit**: For every fix or change in a method, schema,
+   default-value path, or report block, identify the parallel surface in the same
+   codebase (BR ↔ DR, schema ↔ renderer, default ↔ precomputed, summary ↔ full)
+   and check whether the same change applies there. Flag the unmirrored side as P1.
+
+2. **Pattern-wide grep**: When you flag any anti-pattern or bug class, use `grep`
+   on `diff_diff/**.py` to identify sibling occurrences of the same pattern and
+   enumerate them in the SAME finding. Only LOAD a sibling file's full contents
+   if grep returns a hit and you need surrounding context to verify the issue.
+   Do not defer pattern-class findings to a follow-up round.
+
+3. **Reciprocal/symmetry check**: For dispatch code, validation, or guards in
+   one direction (A-on-B), explicitly enumerate the reciprocal direction (B-on-A)
+   and confirm coverage.
+
+4. **Transitive workflow deps**: For GH Actions workflow `paths:` or pytest
+   selection changes, sweep transitive auto-loaded files (conftest.py,
+   pyproject.toml, ancestor conftests) and confirm they are included.
+
+5. **Scope override (with carve-outs)**: The audits above explicitly authorize
+   loading files outside the diff to verify completeness. This overrides the
+   "minimum surrounding context" default in the Rules section below.
+
+   **DO NOT load these paths** (the workflow's diff-build deliberately excludes
+   them; they are noise or out-of-scope):
+   - `docs/tutorials/*.ipynb` (notebook outputs are large JSON blobs)
+   - `benchmarks/data/real/*.json`
+   - `benchmarks/data/real/*.csv`
+
 ## Deferred Work Acceptance
 
 This project tracks deferred technical debt in `TODO.md` under "Tech Debt from Code Reviews."
@@ -68,7 +104,9 @@ This project tracks deferred technical debt in `TODO.md` under "Tech Debt from C
   and incorrect statistical output are not.
 
 Rules:
-- Review ONLY the changes introduced by this PR (diff) and the minimum surrounding context needed.
+- Review the changes introduced by this PR (diff). The Single-Pass Completeness
+  Mandate above authorizes broader audits (sibling surfaces, pattern-wide greps,
+  reciprocal checks, transitive deps) — do those upfront rather than deferring.
 - Provide a single Markdown report with:
   - Overall assessment (see Assessment Criteria below)
   - Executive summary (3–6 bullets)
@@ -116,7 +154,11 @@ When this is a re-review (the PR has prior AI review comments):
 - New P1+ findings on unchanged code MAY be raised but must be marked "[Newly identified]"
   to distinguish from moving goalposts. Limit these to clear, concrete issues — not
   speculative concerns or stylistic preferences.
-- New code added since the last review IS in scope for new findings.
+- New code added since the last review IS in scope for new findings — apply the
+  Single-Pass Completeness Mandate's audits (sibling surfaces, pattern-wide greps,
+  reciprocal checks) to that new code in this re-review pass. For UNCHANGED code,
+  the existing [Newly identified] convention from the bullet above still applies:
+  new P1+ findings MAY be raised but must be marked "[Newly identified]".
 - If all previous P1+ findings are resolved, the assessment should be ✅ even if new
   P2/P3 items are noticed.
 
diff --git a/.github/workflows/ai_pr_review.yml b/.github/workflows/ai_pr_review.yml
@@ -165,7 +165,7 @@ jobs:
           sandbox: read-only
           safety-strategy: drop-sudo
           # Recommended by OpenAI for review quality/consistency:
-          model: gpt-5.4
+          model: gpt-5.5
           effort: xhigh
 
       - name: Post PR comment (new on /ai-review, update on opened)
diff --git a/tests/test_openai_review.py b/tests/test_openai_review.py