Skip to content

Commit 4e848e2

Browse files
igerberclaude
andcommitted
Add content secret-regex scan + non-dry-run gate test + fix stale wording
R3 review on PR #421 (4th round on the codex-surface security finding): P1: Reviewer wanted the preflight to also catch secrets stored under innocuous filenames (notes.txt with API keys, etc.), not just the filename-pattern matches. Implements the recommended fix: - SECRET_CONTENT_PATTERN: same canonical regex used by the api backend's pre-upload scan in skill-doc Step 3b (AKIA*, ghp_*, sk-*, gho_*, api_key=, secret_key=, password=, token=, bearer *, PRIVATE_KEY). - _scan_sensitive_content(): walks the repo, applies the regex to text-suffix files (_SCAN_CONTENT_SUFFIXES) under 1MB, returns matched paths. - _preflight_codex_secrets(): combines filename + content scans into the single check used by the codex gate. False-positive control: _SCAN_SKIP_CONTENT_PREFIXES skips tests/, test/, __tests__/, .github/, docs/, examples/, fixtures/ from the CONTENT scan only — those locations legitimately contain literal pattern matches as test fixtures, regex definitions, or documented examples (NOT real secrets). Filename scan still applies there, so a real .env in tests/ is still caught. Smoke-tested in this repo: before prefix skip, 4 hits (1 real + 3 false positives in test file + 2 workflow files); after prefix skip, 1 hit (real only). P3: Two doc-consistency drifts fixed: - Removed stale "gitignore-aware" / "git ls-files" wording from test class docstring, --allow-secrets help text, and skill doc bullet (we deliberately do NOT respect .gitignore so we catch gitignored .env files). P3: Added TestMainCodexSecretGate (4 tests) — drives main() WITHOUT --dry-run so the secret gate actually runs: - aborts on .env (filename match) → exit 1, codex NOT called - aborts on AKIA in notes.txt (content match) → exit 1, codex NOT called - --allow-secrets converts ABORT → WARNING, codex IS called - clean repo → no warning, codex called normally Skill doc updated to describe both scan layers and the prefix-skip behavior with rationale. Tests: 230 pass (16 new across content scan + prefix skip + main() gate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 20e6aab commit 4e848e2

3 files changed

Lines changed: 373 additions & 16 deletions

File tree

.claude/commands/ai-review-local.md

Lines changed: 29 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -38,24 +38,42 @@ Notes:
3838
chooses what to load on its own); the script warns if you pass them.
3939
- **Surface area + abort-by-default**: under the codex backend, Codex can read
4040
any file under the repo root via `--cd`, not just the staged diff. Before
41-
invoking codex, the script runs a recursive sensitive-file scan for patterns
42-
like `.env`, `.env.local`, `id_rsa`, `*.pem`, `*.key`,
43-
`secrets.{yml,yaml,json}`, `.netrc`, `.npmrc`, `.pypirc`. Common safe
44-
variants (`.env.example`, `.env.sample`, `.env.template`) are excluded.
45-
Heavy directories (`.venv`, `node_modules`, `__pycache__`, etc.) are
46-
skipped to avoid vendored test fixtures. The scan does NOT respect
47-
`.gitignore` — gitignored `.env` files are exactly what we want to catch.
48-
If any matches exist, codex is **NOT invoked** and the script exits 1
49-
unless you pass `--allow-secrets` to acknowledge the surface. This is
50-
enforcement, not just a warning.
41+
invoking codex, the script runs a recursive preflight scan with TWO checks:
42+
43+
1. **Filename patterns**: `.env`, `.env.local`, `id_rsa`, `*.pem`, `*.key`,
44+
`secrets.{yml,yaml,json}`, `.netrc`, `.npmrc`, `.pypirc`, etc. Safe
45+
template variants (`.env.example`, `.env.sample`, `.env.template`)
46+
excluded.
47+
2. **Content secret-regex**: same canonical pattern used for the api
48+
backend's pre-upload scan (Step 3b) — catches AWS keys (`AKIA…`),
49+
GitHub tokens (`ghp_…`, `gho_…`), OpenAI keys (`sk-…`), `api_key=`,
50+
`secret_key=`, `password=`, `token=`, bearer tokens, and
51+
`PRIVATE_KEY` strings — applied recursively to repo files. Catches
52+
secrets stored under innocuous filenames like `notes.txt`. Limited
53+
to common text suffixes (`.py`, `.js`, `.yml`, `.json`, `.env`, `.md`,
54+
etc.); files >1MB skipped (likely binaries/generated assets). Common
55+
false-positive directories (`tests/`, `.github/`, `docs/`, `examples/`,
56+
`fixtures/`) are skipped for content scan only — those locations
57+
legitimately contain literal pattern matches as test fixtures, regex
58+
definitions, or doc examples. Filename scan still applies to those
59+
dirs (so a real `.env` checked into `tests/` would still be caught).
60+
61+
Heavy dirs (`.venv`, `node_modules`, `__pycache__`, `.claude`, etc.) are
62+
skipped to avoid vendored test fixtures. Both scans include gitignored
63+
files — gitignored `.env` files are exactly what we want to catch.
64+
65+
If either scan finds matches, codex is **NOT invoked** and the script
66+
exits 1 unless you pass `--allow-secrets` to acknowledge the surface.
67+
This is enforcement, not just a warning.
5168

5269
## Arguments
5370

5471
`$ARGUMENTS` may contain optional flags:
5572
- `--backend {auto,codex,api}`: Reviewer backend (default: `auto`). See above.
5673
- `--allow-secrets`: Codex backend only. By default, the script aborts before
5774
invoking codex if it detects sensitive files anywhere under the repo root
58-
(gitignore-aware recursive scan). Pass this flag to acknowledge and proceed.
75+
(recursive filename + content-regex scan including gitignored files). Pass
76+
this flag to acknowledge and proceed.
5977
- `--context {minimal,standard,deep}`: Context depth (default: `standard`).
6078
*Api backend only.*
6179
- `minimal`: Diff only (original behavior)

.claude/scripts/openai_review.py

Lines changed: 115 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1198,12 +1198,70 @@ def _resolve_timeout(timeout: "int | None", model: str) -> int:
11981198
# examples). Excluded from sensitive-file matches.
11991199
SENSITIVE_FILE_SAFE_SUFFIXES = (".example", ".sample", ".template", ".dist")
12001200

1201-
# Directories to skip during fallback os.walk scan (when not in a git repo).
1201+
# Directories to skip during the recursive scan. The skip set covers heavy
1202+
# vendored/generated dirs that often contain test fixtures matching the
1203+
# sensitive-filename or content patterns; including them would bury real
1204+
# matches in noise without adding signal.
12021205
_SCAN_SKIP_DIRS = frozenset({
12031206
".git", ".venv", "venv", ".tox", ".eggs", ".pytest_cache", ".mypy_cache",
12041207
"node_modules", "__pycache__", "dist", "build", "target",
1208+
".claude", # local review artifacts (.claude/reviews/) + tooling
12051209
})
12061210

1211+
# Path prefixes (repo-relative, forward-slash) skipped by the CONTENT scan
1212+
# only — the filename scan still applies. These directories commonly contain
1213+
# literal pattern matches as test fixtures, regex definitions, or documented
1214+
# examples (NOT real secrets), so blanket-skipping them keeps the false-
1215+
# positive rate manageable. Real secrets in these locations would also be
1216+
# committed to source control, which is a separate problem the user should
1217+
# notice via code review long before our preflight matters.
1218+
_SCAN_SKIP_CONTENT_PREFIXES = (
1219+
"tests/", "test/", "__tests__/",
1220+
".github/",
1221+
"docs/",
1222+
"examples/", "example/",
1223+
"fixtures/",
1224+
)
1225+
1226+
# Canonical secret-content regex — mirrors the patterns in
1227+
# .claude/commands/ai-review-local.md Step 3b (the pre-upload diff scan) so
1228+
# the codex preflight uses the same definition of "secret content" as the
1229+
# api-backend scan. Detects:
1230+
# - AWS access key IDs (AKIA prefix)
1231+
# - GitHub tokens (ghp_, gho_)
1232+
# - OpenAI API keys (sk-)
1233+
# - Common assignment patterns: api_key=, secret_key=, password=, token=
1234+
# - Bearer tokens
1235+
# - PRIVATE_KEY identifiers
1236+
SECRET_CONTENT_PATTERN = re.compile(
1237+
r"AKIA[A-Z0-9]{16}"
1238+
r"|ghp_[A-Za-z0-9]{36}"
1239+
r"|sk-[A-Za-z0-9]{48}"
1240+
r"|gho_[A-Za-z0-9]{36}"
1241+
r"|[Aa][Pp][Ii][_-]?[Kk][Ee][Yy][\t ]*[=:]"
1242+
r"|[Ss][Ee][Cc][Rr][Ee][Tt][_-]?[Kk][Ee][Yy][\t ]*[=:]"
1243+
r"|[Pp][Aa][Ss][Ss][Ww][Oo][Rr][Dd][\t ]*[=:]"
1244+
r"|[Pp][Rr][Ii][Vv][Aa][Tt][Ee][_-]?[Kk][Ee][Yy]"
1245+
r"|[Bb][Ee][Aa][Rr][Ee][Rr][\t ]+[A-Za-z0-9_-]+"
1246+
r"|[Tt][Oo][Kk][Ee][Nn][\t ]*[=:]"
1247+
)
1248+
1249+
# File-size cap for content scan (skip files >1MB — typically binaries or
1250+
# generated assets, not human-authored code where secrets would be).
1251+
_SCAN_MAX_FILE_BYTES = 1_000_000
1252+
1253+
# Suffixes worth content-scanning. Skip binary/asset/generated formats where
1254+
# false positives are common and real secrets are not how-people-store-them.
1255+
_SCAN_CONTENT_SUFFIXES = (
1256+
".py", ".js", ".ts", ".jsx", ".tsx", ".rs", ".go", ".rb", ".java",
1257+
".sh", ".bash", ".zsh", ".fish",
1258+
".yml", ".yaml", ".json", ".toml", ".ini", ".cfg", ".conf", ".config",
1259+
".env", ".envrc",
1260+
".txt", ".md", ".rst",
1261+
".sql", ".graphql",
1262+
".html", ".xml",
1263+
)
1264+
12071265

12081266
def _list_files_for_scan(repo_root: str) -> "list[str]":
12091267
"""Return repo-relative paths of files to scan for sensitive patterns.
@@ -1249,6 +1307,57 @@ def _scan_sensitive_files(repo_root: str) -> "list[str]":
12491307
return sorted(set(found))
12501308

12511309

1310+
def _scan_sensitive_content(repo_root: str) -> "list[str]":
1311+
"""Recursively scan repo file contents for the canonical secret-content
1312+
regex (mirrors `.claude/commands/ai-review-local.md` Step 3b's pattern).
1313+
1314+
Catches secrets stored under innocuous filenames — e.g. an API key in
1315+
`notes.txt` or `config.yml` — that the basename scan in
1316+
`_scan_sensitive_files()` would miss.
1317+
1318+
Scope limits to keep runtime + false-positive count manageable:
1319+
- File suffixes in `_SCAN_CONTENT_SUFFIXES` only (skip binaries / assets)
1320+
- Files > `_SCAN_MAX_FILE_BYTES` are skipped (likely binaries / generated)
1321+
- Same skip-dir set as `_scan_sensitive_files()`
1322+
- Same gitignored-files-included posture (we want to catch `.env`)
1323+
1324+
Returns repo-relative paths of files containing at least one match.
1325+
"""
1326+
found: "list[str]" = []
1327+
for rel_path in _list_files_for_scan(repo_root):
1328+
if not rel_path.endswith(_SCAN_CONTENT_SUFFIXES):
1329+
continue
1330+
# Skip content scan for path prefixes that commonly hold literal
1331+
# pattern matches as fixtures / regex definitions / examples (not
1332+
# real secrets). Filename scan still applies to these dirs.
1333+
normalized = rel_path.replace(os.sep, "/")
1334+
if any(normalized.startswith(p) for p in _SCAN_SKIP_CONTENT_PREFIXES):
1335+
continue
1336+
full_path = os.path.join(repo_root, rel_path)
1337+
try:
1338+
if os.path.getsize(full_path) > _SCAN_MAX_FILE_BYTES:
1339+
continue
1340+
except OSError:
1341+
continue
1342+
try:
1343+
with open(full_path, "r", encoding="utf-8", errors="ignore") as f:
1344+
content = f.read()
1345+
except (OSError, UnicodeDecodeError):
1346+
continue
1347+
if SECRET_CONTENT_PATTERN.search(content):
1348+
found.append(rel_path)
1349+
return sorted(set(found))
1350+
1351+
1352+
def _preflight_codex_secrets(repo_root: str) -> "list[str]":
1353+
"""Combined preflight: returns unique repo-relative paths flagged by
1354+
EITHER the filename scan OR the content scan. Empty list = clean repo,
1355+
safe to invoke codex without `--allow-secrets`."""
1356+
return sorted(set(
1357+
_scan_sensitive_files(repo_root) + _scan_sensitive_content(repo_root)
1358+
))
1359+
1360+
12521361
def _print_sensitive_warning(
12531362
repo_root: str, found: "list[str]", abort: bool
12541363
) -> None:
@@ -1264,7 +1373,8 @@ def _print_sensitive_warning(
12641373
print(f" --cd {repo_root}", file=sys.stderr)
12651374
print(
12661375
f"Detected {len(found)} potentially sensitive file(s) "
1267-
"(recursive scan, includes gitignored files):",
1376+
"(recursive scan: filename patterns + content secret-regex; "
1377+
"includes gitignored files):",
12681378
file=sys.stderr,
12691379
)
12701380
for f in found[:20]:
@@ -1725,7 +1835,8 @@ def main() -> None:
17251835
help=(
17261836
"Codex backend only. By default, the script aborts before "
17271837
"invoking codex if it detects potentially sensitive files in the "
1728-
"repo (.env, *.pem, id_rsa, secrets.*, etc.; gitignore-aware). "
1838+
"repo (.env, *.pem, id_rsa, secrets.*, etc., plus content "
1839+
"secret-regex; recursive scan including gitignored files). "
17291840
"Pass this flag to acknowledge the surface and proceed anyway. "
17301841
"Codex CAN read those files inside its agentic loop (under the "
17311842
"read-only sandbox)."
@@ -2069,7 +2180,7 @@ def main() -> None:
20692180

20702181
if backend == "codex":
20712182
codex_repo_root = args.repo_root or os.getcwd()
2072-
sensitive = _scan_sensitive_files(codex_repo_root)
2183+
sensitive = _preflight_codex_secrets(codex_repo_root)
20732184
if sensitive:
20742185
_print_sensitive_warning(
20752186
codex_repo_root, sensitive, abort=not args.allow_secrets

0 commit comments

Comments
 (0)