Skip to content

Commit 7b6805c

Browse files
authored
Merge pull request #404 from igerber/update-codex-review
Bump AI PR review to gpt-5.5 + add Single-Pass Completeness Mandate
2 parents fe80295 + dacba16 commit 7b6805c

5 files changed

Lines changed: 282 additions & 33 deletions

File tree

.claude/commands/ai-review-local.md

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -22,14 +22,14 @@ pre-PR use. Designed for iterative review/revision cycles before submitting a PR
2222
files (default: 200000). Changed source files are always included regardless of budget.
2323
- `--force-fresh`: Skip delta-diff mode, run a full fresh review even if previous state exists
2424
- `--full-registry`: Include the entire REGISTRY.md instead of selective sections
25-
- `--model <name>`: Override the OpenAI model (default: `gpt-5.4`)
26-
- `--timeout <seconds>`: HTTP request timeout (default: 300). Use 900 for reasoning models.
25+
- `--model <name>`: Override the OpenAI model (default: `gpt-5.5`)
26+
- `--timeout <seconds>`: HTTP request timeout. If omitted, defaults to 900 for reasoning models (gpt-5.4, gpt-5.5, *-pro, o1/o3/o4) and 300 otherwise.
2727
- `--dry-run`: Print the compiled prompt without calling the API
2828

29-
**Reasoning models** (`gpt-5.4-pro`, `o3`, `o4-mini`, etc.): Reviews may take 10-15
29+
**Reasoning models** (`gpt-5.5`, `gpt-5.5-pro`, `o3`, `o4-mini`, etc.): Reviews may take 10-15
3030
minutes. For deep reviews with reasoning models, combine `--token-budget` with `--model`:
3131
```
32-
/ai-review-local --model gpt-5.4-pro --token-budget 500000 --context deep
32+
/ai-review-local --model gpt-5.5-pro --token-budget 500000 --context deep
3333
```
3434

3535
## Constraints
@@ -47,7 +47,7 @@ before any data is sent externally.
4747
### Step 1: Parse Arguments
4848

4949
Parse `$ARGUMENTS` for the optional flags listed above. All flags are optional —
50-
the default behavior (standard context, selective registry, gpt-5.4, live API call)
50+
the default behavior (standard context, selective registry, gpt-5.5, live API call)
5151
requires no arguments.
5252

5353
### Step 2: Validate Prerequisites
@@ -334,9 +334,15 @@ python3 .claude/scripts/openai_review.py \
334334
Note: `--force-fresh` is a skill-only flag — it controls whether delta diffs are
335335
generated in Step 4 and is NOT passed to the script.
336336

337-
**Reasoning model handling:** If the model contains `-pro` or starts with `o1`/`o3`/`o4`
338-
(e.g., `gpt-5.4-pro`, `o3`, `o4-mini`):
339-
- Pass `--timeout 900` to the script (unless the user explicitly specified `--timeout`)
337+
**Reasoning model handling:** Resolve the effective model first — `effective_model` is
338+
the value of `--model` if the user provided one, otherwise the script default `gpt-5.5`.
339+
The `--model`, `--timeout`, and `--dry-run` flags pass through to the script when provided.
340+
341+
If `effective_model` contains `-pro`, starts with `o1`/`o3`/`o4`, or starts with
342+
`gpt-5.4`/`gpt-5.5` (e.g., `gpt-5.5`, `gpt-5.5-pro`, `o3`, `o4-mini`):
343+
- The script's `_resolve_timeout()` already auto-selects 900s for these models when
344+
`--timeout` is omitted, so no wrapper timeout pass-through is required. (Passing
345+
`--timeout 900` explicitly remains harmless and is fine for backward compatibility.)
340346
- Run the Bash command with `run_in_background: true` (bypasses the 600s Bash tool timeout cap)
341347
- After the background command completes, continue to Step 6
342348

@@ -466,7 +472,7 @@ runs `--force-fresh` or when a rebase invalidates the tracked commit.
466472
/ai-review-local --model gpt-4.1 --full-registry
467473

468474
# Deep review with reasoning model (may take 10-15 minutes)
469-
/ai-review-local --model gpt-5.4-pro --token-budget 500000 --context deep
475+
/ai-review-local --model gpt-5.5-pro --token-budget 500000 --context deep
470476

471477
# Limit token budget for faster/cheaper reviews
472478
/ai-review-local --token-budget 100000
@@ -501,7 +507,7 @@ runs `--force-fresh` or when a rebase invalidates the tracked commit.
501507
- **Data transmission**: In non-dry-run mode, this skill transmits the unified diff,
502508
changed-file metadata, full source file contents (in standard/deep mode),
503509
import-context files (in deep mode), selected methodology registry text, and
504-
prior review context (if present) to OpenAI via the Chat Completions API.
510+
prior review context (if present) to OpenAI via the Responses API.
505511
Use `--dry-run` to preview exactly what would be sent.
506512
- This skill pairs naturally with the iterative workflow:
507513
`/ai-review-local` -> address findings -> `/ai-review-local` -> `/submit-pr`

.claude/scripts/openai_review.py

Lines changed: 96 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -853,6 +853,8 @@ def apply_token_budget(
853853
# Source: https://platform.openai.com/docs/pricing
854854
# MAINTENANCE: Update when OpenAI changes pricing.
855855
PRICING = {
856+
"gpt-5.5": (5.00, 30.00),
857+
"gpt-5.5-pro": (30.00, 180.00),
856858
"gpt-5.4": (2.50, 15.00),
857859
"gpt-5.4-pro": (30.00, 180.00),
858860
"gpt-4.1": (2.00, 8.00),
@@ -900,8 +902,8 @@ def estimate_cost(
900902
"code changes that have not yet been submitted as a pull request.",
901903
),
902904
(
903-
"Review ONLY the changes introduced by this PR (diff)",
904-
"Review ONLY the changes shown in the diff below",
905+
"Review the changes introduced by this PR (diff)",
906+
"Review the code changes shown in the diff below",
905907
),
906908
(
907909
"If the PR changes an estimator",
@@ -933,6 +935,62 @@ def estimate_cost(
933935
"Use the branch name only to understand which "
934936
"methods/papers are intended.",
935937
),
938+
# Replace the CI Single-Pass Completeness Mandate with a local-mode note.
939+
# The CI Mandate instructs the reviewer to run shell greps, load sibling
940+
# files, and sweep transitive paths — none of which the local raw API
941+
# path can do. Leaving the CI wording in place would cause the model to
942+
# claim audits it cannot perform.
943+
(
944+
"""## Single-Pass Completeness Mandate (Initial Review Only)
945+
946+
This is an INITIAL review. Treat this as the only chance to enumerate findings.
947+
Follow-up rounds are expensive — find ALL P0/P1/P2 issues in this pass.
948+
949+
Before finalizing, confirm you have run each of these audits on the diff:
950+
951+
1. **Sibling-surface mirror audit**: For every fix or change in a method, schema,
952+
default-value path, or report block, identify the parallel surface in the same
953+
codebase (BR ↔ DR, schema ↔ renderer, default ↔ precomputed, summary ↔ full)
954+
and check whether the same change applies there. Flag the unmirrored side as P1.
955+
956+
2. **Pattern-wide grep**: When you flag any anti-pattern or bug class, use `grep`
957+
on `diff_diff/**.py` to identify sibling occurrences of the same pattern and
958+
enumerate them in the SAME finding. Only LOAD a sibling file's full contents
959+
if grep returns a hit and you need surrounding context to verify the issue.
960+
Do not defer pattern-class findings to a follow-up round.
961+
962+
3. **Reciprocal/symmetry check**: For dispatch code, validation, or guards in
963+
one direction (A-on-B), explicitly enumerate the reciprocal direction (B-on-A)
964+
and confirm coverage.
965+
966+
4. **Transitive workflow deps**: For GH Actions workflow `paths:` or pytest
967+
selection changes, sweep transitive auto-loaded files (conftest.py,
968+
pyproject.toml, ancestor conftests) and confirm they are included.
969+
970+
5. **Scope override (with carve-outs)**: The audits above explicitly authorize
971+
loading files outside the diff to verify completeness. This overrides the
972+
"minimum surrounding context" default in the Rules section below.
973+
974+
**DO NOT load these paths** (the workflow's diff-build deliberately excludes
975+
them; they are noise or out-of-scope):
976+
- `docs/tutorials/*.ipynb` (notebook outputs are large JSON blobs)
977+
- `benchmarks/data/real/*.json`
978+
- `benchmarks/data/real/*.csv`""",
979+
"""## Single-Pass Completeness Audit (Local Review)
980+
981+
This is a local review running as a static-prompt API call. You do NOT have
982+
shell or file-loading access — only the prompt content below is available
983+
(diff + changed source files + first-level imports).
984+
985+
Find ALL P0/P1/P2 issues within the loaded context. Audit sibling surfaces,
986+
parallel patterns, and reciprocal directions THAT ARE VISIBLE in the loaded
987+
files.
988+
989+
Do NOT claim to have run shell greps, loaded sibling files outside the
990+
prompt, or audited paths not present here. If a relevant audit is impossible
991+
because the necessary context is not in the prompt, say so explicitly rather
992+
than asserting completeness.""",
993+
),
936994
]
937995

938996

@@ -1095,15 +1153,30 @@ def compile_prompt(
10951153
# ---------------------------------------------------------------------------
10961154

10971155
ENDPOINT = "https://api.openai.com/v1/responses"
1098-
DEFAULT_MODEL = "gpt-5.4"
1099-
DEFAULT_TIMEOUT = 300 # seconds
1156+
DEFAULT_MODEL = "gpt-5.5"
1157+
DEFAULT_TIMEOUT = 300 # seconds — non-reasoning models
1158+
REASONING_TIMEOUT = 900 # seconds — reasoning models can take 10-15 min
11001159
DEFAULT_MAX_TOKENS = 16384
11011160
REASONING_MAX_TOKENS = 32768
11021161

11031162

11041163
def _is_reasoning_model(model: str) -> bool:
11051164
"""Return True for models that use internal chain-of-thought reasoning."""
1106-
return model.startswith(("o1", "o3", "o4")) or "-pro" in model
1165+
return model.startswith(("o1", "o3", "o4", "gpt-5.4", "gpt-5.5")) or "-pro" in model
1166+
1167+
1168+
def _resolve_timeout(timeout: "int | None", model: str) -> int:
1169+
"""Resolve the effective HTTP timeout for an API call.
1170+
1171+
If --timeout was explicitly provided, use it. Otherwise pick
1172+
REASONING_TIMEOUT (900s) for reasoning models and DEFAULT_TIMEOUT
1173+
(300s) for standard models. This prevents reasoning-model reviews
1174+
from hitting a too-short default when the wrapper command does not
1175+
pass --timeout.
1176+
"""
1177+
if timeout is not None:
1178+
return timeout
1179+
return REASONING_TIMEOUT if _is_reasoning_model(model) else DEFAULT_TIMEOUT
11071180

11081181

11091182
def estimate_tokens(text: str) -> int:
@@ -1134,13 +1207,19 @@ def call_openai(
11341207
prompt: str,
11351208
model: str,
11361209
api_key: str,
1137-
timeout: int = DEFAULT_TIMEOUT,
1210+
timeout: "int | None" = None,
11381211
) -> "tuple[str, dict]":
11391212
"""Call the OpenAI Responses API.
11401213
1214+
If ``timeout`` is None, resolves to REASONING_TIMEOUT (900s) for
1215+
reasoning models and DEFAULT_TIMEOUT (300s) otherwise — same logic
1216+
as the CLI ``--timeout`` flag. This guards future direct callers
1217+
against the old 300s-everywhere default.
1218+
11411219
Returns (content, usage) where usage is the API response's usage dict
11421220
containing input_tokens and output_tokens.
11431221
"""
1222+
timeout = _resolve_timeout(timeout, model)
11441223
reasoning = _is_reasoning_model(model)
11451224
max_tokens = REASONING_MAX_TOKENS if reasoning else DEFAULT_MAX_TOKENS
11461225

@@ -1343,8 +1422,12 @@ def main() -> None:
13431422
parser.add_argument(
13441423
"--timeout",
13451424
type=int,
1346-
default=DEFAULT_TIMEOUT,
1347-
help=f"HTTP request timeout in seconds (default: {DEFAULT_TIMEOUT})",
1425+
default=None,
1426+
help=(
1427+
f"HTTP request timeout in seconds. If omitted, defaults to "
1428+
f"{REASONING_TIMEOUT} for reasoning models and {DEFAULT_TIMEOUT} for "
1429+
f"standard models."
1430+
),
13481431
)
13491432
parser.add_argument(
13501433
"--delta-diff",
@@ -1374,6 +1457,10 @@ def main() -> None:
13741457

13751458
args = parser.parse_args()
13761459

1460+
# Resolve --timeout: reasoning models default to REASONING_TIMEOUT (900s),
1461+
# standard models to DEFAULT_TIMEOUT (300s). Explicit --timeout overrides.
1462+
args.timeout = _resolve_timeout(args.timeout, args.model)
1463+
13771464
# Post-parse validation
13781465
if args.context != "minimal" and not args.repo_root:
13791466
parser.error(
@@ -1614,13 +1701,7 @@ def main() -> None:
16141701
sys.exit(0)
16151702

16161703
# Call OpenAI API
1617-
if _is_reasoning_model(args.model) and args.timeout == DEFAULT_TIMEOUT:
1618-
print(
1619-
f"Note: {args.model} is a reasoning model. Consider --timeout 900 "
1620-
"for large reviews.",
1621-
file=sys.stderr,
1622-
)
1623-
print(f"Sending review to {args.model}...", file=sys.stderr)
1704+
print(f"Sending review to {args.model} (timeout: {args.timeout}s)...", file=sys.stderr)
16241705
print(f"Estimated input tokens: ~{est_tokens:,}", file=sys.stderr)
16251706
if cost_str:
16261707
print(f"Estimated cost: {cost_str}", file=sys.stderr)

.github/codex/prompts/pr_review.md

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,42 @@ When reviewing new features or code paths, specifically check:
5454
- Command to check: `grep -n "pattern" diff_diff/*.py`
5555
- Flag as P1 if only partial fixes were made
5656

57+
## Single-Pass Completeness Mandate (Initial Review Only)
58+
59+
This is an INITIAL review. Treat this as the only chance to enumerate findings.
60+
Follow-up rounds are expensive — find ALL P0/P1/P2 issues in this pass.
61+
62+
Before finalizing, confirm you have run each of these audits on the diff:
63+
64+
1. **Sibling-surface mirror audit**: For every fix or change in a method, schema,
65+
default-value path, or report block, identify the parallel surface in the same
66+
codebase (BR ↔ DR, schema ↔ renderer, default ↔ precomputed, summary ↔ full)
67+
and check whether the same change applies there. Flag the unmirrored side as P1.
68+
69+
2. **Pattern-wide grep**: When you flag any anti-pattern or bug class, use `grep`
70+
on `diff_diff/**.py` to identify sibling occurrences of the same pattern and
71+
enumerate them in the SAME finding. Only LOAD a sibling file's full contents
72+
if grep returns a hit and you need surrounding context to verify the issue.
73+
Do not defer pattern-class findings to a follow-up round.
74+
75+
3. **Reciprocal/symmetry check**: For dispatch code, validation, or guards in
76+
one direction (A-on-B), explicitly enumerate the reciprocal direction (B-on-A)
77+
and confirm coverage.
78+
79+
4. **Transitive workflow deps**: For GH Actions workflow `paths:` or pytest
80+
selection changes, sweep transitive auto-loaded files (conftest.py,
81+
pyproject.toml, ancestor conftests) and confirm they are included.
82+
83+
5. **Scope override (with carve-outs)**: The audits above explicitly authorize
84+
loading files outside the diff to verify completeness. This overrides the
85+
"minimum surrounding context" default in the Rules section below.
86+
87+
**DO NOT load these paths** (the workflow's diff-build deliberately excludes
88+
them; they are noise or out-of-scope):
89+
- `docs/tutorials/*.ipynb` (notebook outputs are large JSON blobs)
90+
- `benchmarks/data/real/*.json`
91+
- `benchmarks/data/real/*.csv`
92+
5793
## Deferred Work Acceptance
5894

5995
This project tracks deferred technical debt in `TODO.md` under "Tech Debt from Code Reviews."
@@ -68,7 +104,9 @@ This project tracks deferred technical debt in `TODO.md` under "Tech Debt from C
68104
and incorrect statistical output are not.
69105

70106
Rules:
71-
- Review ONLY the changes introduced by this PR (diff) and the minimum surrounding context needed.
107+
- Review the changes introduced by this PR (diff). The Single-Pass Completeness
108+
Mandate above authorizes broader audits (sibling surfaces, pattern-wide greps,
109+
reciprocal checks, transitive deps) — do those upfront rather than deferring.
72110
- Provide a single Markdown report with:
73111
- Overall assessment (see Assessment Criteria below)
74112
- Executive summary (3–6 bullets)
@@ -116,7 +154,11 @@ When this is a re-review (the PR has prior AI review comments):
116154
- New P1+ findings on unchanged code MAY be raised but must be marked "[Newly identified]"
117155
to distinguish from moving goalposts. Limit these to clear, concrete issues — not
118156
speculative concerns or stylistic preferences.
119-
- New code added since the last review IS in scope for new findings.
157+
- New code added since the last review IS in scope for new findings — apply the
158+
Single-Pass Completeness Mandate's audits (sibling surfaces, pattern-wide greps,
159+
reciprocal checks) to that new code in this re-review pass. For UNCHANGED code,
160+
the existing [Newly identified] convention from the bullet above still applies:
161+
new P1+ findings MAY be raised but must be marked "[Newly identified]".
120162
- If all previous P1+ findings are resolved, the assessment should be ✅ even if new
121163
P2/P3 items are noticed.
122164

.github/workflows/ai_pr_review.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ jobs:
165165
sandbox: read-only
166166
safety-strategy: drop-sudo
167167
# Recommended by OpenAI for review quality/consistency:
168-
model: gpt-5.4
168+
model: gpt-5.5
169169
effort: xhigh
170170

171171
- name: Post PR comment (new on /ai-review, update on opened)

0 commit comments

Comments
 (0)