Add content secret-regex scan + non-dry-run gate test + fix stale wording

igerber · claude · igerber · commit 4e848e2359ed · 2026-05-12T21:21:31.000-04:00
R3 review on PR #421 (4th round on the codex-surface security finding): P1: Reviewer wanted the preflight to also catch secrets stored under innocuous filenames (notes.txt with API keys, etc.), not just the filename-pattern matches. Implements the recommended fix: - SECRET_CONTENT_PATTERN: same canonical regex used by the api backend's pre-upload scan in skill-doc Step 3b (AKIA*, ghp_*, sk-*, gho_*, api_key=, secret_key=, password=, token=, bearer *, PRIVATE_KEY). - _scan_sensitive_content(): walks the repo, applies the regex to text-suffix files (_SCAN_CONTENT_SUFFIXES) under 1MB, returns matched paths. - _preflight_codex_secrets(): combines filename + content scans into the single check used by the codex gate. False-positive control: _SCAN_SKIP_CONTENT_PREFIXES skips tests/, test/, __tests__/, .github/, docs/, examples/, fixtures/ from the CONTENT scan only — those locations legitimately contain literal pattern matches as test fixtures, regex definitions, or documented examples (NOT real secrets). Filename scan still applies there, so a real .env in tests/ is still caught. Smoke-tested in this repo: before prefix skip, 4 hits (1 real + 3 false positives in test file + 2 workflow files); after prefix skip, 1 hit (real only). P3: Two doc-consistency drifts fixed: - Removed stale "gitignore-aware" / "git ls-files" wording from test class docstring, --allow-secrets help text, and skill doc bullet (we deliberately do NOT respect .gitignore so we catch gitignored .env files). P3: Added TestMainCodexSecretGate (4 tests) — drives main() WITHOUT --dry-run so the secret gate actually runs: - aborts on .env (filename match) → exit 1, codex NOT called - aborts on AKIA in notes.txt (content match) → exit 1, codex NOT called - --allow-secrets converts ABORT → WARNING, codex IS called - clean repo → no warning, codex called normally Skill doc updated to describe both scan layers and the prefix-skip behavior with rationale. Tests: 230 pass (16 new across content scan + prefix skip + main() gate). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/.claude/commands/ai-review-local.md b/.claude/commands/ai-review-local.md
@@ -38,24 +38,42 @@ Notes:
   chooses what to load on its own); the script warns if you pass them.
 - **Surface area + abort-by-default**: under the codex backend, Codex can read
   any file under the repo root via `--cd`, not just the staged diff. Before
-  invoking codex, the script runs a recursive sensitive-file scan for patterns
-  like `.env`, `.env.local`, `id_rsa`, `*.pem`, `*.key`,
-  `secrets.{yml,yaml,json}`, `.netrc`, `.npmrc`, `.pypirc`. Common safe
-  variants (`.env.example`, `.env.sample`, `.env.template`) are excluded.
-  Heavy directories (`.venv`, `node_modules`, `__pycache__`, etc.) are
-  skipped to avoid vendored test fixtures. The scan does NOT respect
-  `.gitignore` — gitignored `.env` files are exactly what we want to catch.
-  If any matches exist, codex is **NOT invoked** and the script exits 1
-  unless you pass `--allow-secrets` to acknowledge the surface. This is
-  enforcement, not just a warning.
+  invoking codex, the script runs a recursive preflight scan with TWO checks:
+
+  1. **Filename patterns**: `.env`, `.env.local`, `id_rsa`, `*.pem`, `*.key`,
+     `secrets.{yml,yaml,json}`, `.netrc`, `.npmrc`, `.pypirc`, etc. Safe
+     template variants (`.env.example`, `.env.sample`, `.env.template`)
+     excluded.
+  2. **Content secret-regex**: same canonical pattern used for the api
+     backend's pre-upload scan (Step 3b) — catches AWS keys (`AKIA…`),
+     GitHub tokens (`ghp_…`, `gho_…`), OpenAI keys (`sk-…`), `api_key=`,
+     `secret_key=`, `password=`, `token=`, bearer tokens, and
+     `PRIVATE_KEY` strings — applied recursively to repo files. Catches
+     secrets stored under innocuous filenames like `notes.txt`. Limited
+     to common text suffixes (`.py`, `.js`, `.yml`, `.json`, `.env`, `.md`,
+     etc.); files >1MB skipped (likely binaries/generated assets). Common
+     false-positive directories (`tests/`, `.github/`, `docs/`, `examples/`,
+     `fixtures/`) are skipped for content scan only — those locations
+     legitimately contain literal pattern matches as test fixtures, regex
+     definitions, or doc examples. Filename scan still applies to those
+     dirs (so a real `.env` checked into `tests/` would still be caught).
+
+  Heavy dirs (`.venv`, `node_modules`, `__pycache__`, `.claude`, etc.) are
+  skipped to avoid vendored test fixtures. Both scans include gitignored
+  files — gitignored `.env` files are exactly what we want to catch.
+
+  If either scan finds matches, codex is **NOT invoked** and the script
+  exits 1 unless you pass `--allow-secrets` to acknowledge the surface.
+  This is enforcement, not just a warning.
 
 ## Arguments
 
 `$ARGUMENTS` may contain optional flags:
 - `--backend {auto,codex,api}`: Reviewer backend (default: `auto`). See above.
 - `--allow-secrets`: Codex backend only. By default, the script aborts before
   invoking codex if it detects sensitive files anywhere under the repo root
-  (gitignore-aware recursive scan). Pass this flag to acknowledge and proceed.
+  (recursive filename + content-regex scan including gitignored files). Pass
+  this flag to acknowledge and proceed.
 - `--context {minimal,standard,deep}`: Context depth (default: `standard`).
   *Api backend only.*
   - `minimal`: Diff only (original behavior)
diff --git a/.claude/scripts/openai_review.py b/.claude/scripts/openai_review.py
@@ -1198,12 +1198,70 @@ def _resolve_timeout(timeout: "int | None", model: str) -> int:
 # examples). Excluded from sensitive-file matches.
 SENSITIVE_FILE_SAFE_SUFFIXES = (".example", ".sample", ".template", ".dist")
 
-# Directories to skip during fallback os.walk scan (when not in a git repo).
+# Directories to skip during the recursive scan. The skip set covers heavy
+# vendored/generated dirs that often contain test fixtures matching the
+# sensitive-filename or content patterns; including them would bury real
+# matches in noise without adding signal.
 _SCAN_SKIP_DIRS = frozenset({
     ".git", ".venv", "venv", ".tox", ".eggs", ".pytest_cache", ".mypy_cache",
     "node_modules", "__pycache__", "dist", "build", "target",
+    ".claude",  # local review artifacts (.claude/reviews/) + tooling
 })
 
+# Path prefixes (repo-relative, forward-slash) skipped by the CONTENT scan
+# only — the filename scan still applies. These directories commonly contain
+# literal pattern matches as test fixtures, regex definitions, or documented
+# examples (NOT real secrets), so blanket-skipping them keeps the false-
+# positive rate manageable. Real secrets in these locations would also be
+# committed to source control, which is a separate problem the user should
+# notice via code review long before our preflight matters.
+_SCAN_SKIP_CONTENT_PREFIXES = (
+    "tests/", "test/", "__tests__/",
+    ".github/",
+    "docs/",
+    "examples/", "example/",
+    "fixtures/",
+)
+
+# Canonical secret-content regex — mirrors the patterns in
+# .claude/commands/ai-review-local.md Step 3b (the pre-upload diff scan) so
+# the codex preflight uses the same definition of "secret content" as the
+# api-backend scan. Detects:
+#   - AWS access key IDs (AKIA prefix)
+#   - GitHub tokens (ghp_, gho_)
+#   - OpenAI API keys (sk-)
+#   - Common assignment patterns: api_key=, secret_key=, password=, token=
+#   - Bearer tokens
+#   - PRIVATE_KEY identifiers
+SECRET_CONTENT_PATTERN = re.compile(
+    r"AKIA[A-Z0-9]{16}"
+    r"|ghp_[A-Za-z0-9]{36}"
+    r"|sk-[A-Za-z0-9]{48}"
+    r"|gho_[A-Za-z0-9]{36}"
+    r"|[Aa][Pp][Ii][_-]?[Kk][Ee][Yy][\t ]*[=:]"
+    r"|[Ss][Ee][Cc][Rr][Ee][Tt][_-]?[Kk][Ee][Yy][\t ]*[=:]"
+    r"|[Pp][Aa][Ss][Ss][Ww][Oo][Rr][Dd][\t ]*[=:]"
+    r"|[Pp][Rr][Ii][Vv][Aa][Tt][Ee][_-]?[Kk][Ee][Yy]"
+    r"|[Bb][Ee][Aa][Rr][Ee][Rr][\t ]+[A-Za-z0-9_-]+"
+    r"|[Tt][Oo][Kk][Ee][Nn][\t ]*[=:]"
+)
+
+# File-size cap for content scan (skip files >1MB — typically binaries or
+# generated assets, not human-authored code where secrets would be).
+_SCAN_MAX_FILE_BYTES = 1_000_000
+
+# Suffixes worth content-scanning. Skip binary/asset/generated formats where
+# false positives are common and real secrets are not how-people-store-them.
+_SCAN_CONTENT_SUFFIXES = (
+    ".py", ".js", ".ts", ".jsx", ".tsx", ".rs", ".go", ".rb", ".java",
+    ".sh", ".bash", ".zsh", ".fish",
+    ".yml", ".yaml", ".json", ".toml", ".ini", ".cfg", ".conf", ".config",
+    ".env", ".envrc",
+    ".txt", ".md", ".rst",
+    ".sql", ".graphql",
+    ".html", ".xml",
+)
+
 
 def _list_files_for_scan(repo_root: str) -> "list[str]":
     """Return repo-relative paths of files to scan for sensitive patterns.
@@ -1249,6 +1307,57 @@ def _scan_sensitive_files(repo_root: str) -> "list[str]":
     return sorted(set(found))
 
 
+def _scan_sensitive_content(repo_root: str) -> "list[str]":
+    """Recursively scan repo file contents for the canonical secret-content
+    regex (mirrors `.claude/commands/ai-review-local.md` Step 3b's pattern).
+
+    Catches secrets stored under innocuous filenames — e.g. an API key in
+    `notes.txt` or `config.yml` — that the basename scan in
+    `_scan_sensitive_files()` would miss.
+
+    Scope limits to keep runtime + false-positive count manageable:
+      - File suffixes in `_SCAN_CONTENT_SUFFIXES` only (skip binaries / assets)
+      - Files > `_SCAN_MAX_FILE_BYTES` are skipped (likely binaries / generated)
+      - Same skip-dir set as `_scan_sensitive_files()`
+      - Same gitignored-files-included posture (we want to catch `.env`)
+
+    Returns repo-relative paths of files containing at least one match.
+    """
+    found: "list[str]" = []
+    for rel_path in _list_files_for_scan(repo_root):
+        if not rel_path.endswith(_SCAN_CONTENT_SUFFIXES):
+            continue
+        # Skip content scan for path prefixes that commonly hold literal
+        # pattern matches as fixtures / regex definitions / examples (not
+        # real secrets). Filename scan still applies to these dirs.
+        normalized = rel_path.replace(os.sep, "/")
+        if any(normalized.startswith(p) for p in _SCAN_SKIP_CONTENT_PREFIXES):
+            continue
+        full_path = os.path.join(repo_root, rel_path)
+        try:
+            if os.path.getsize(full_path) > _SCAN_MAX_FILE_BYTES:
+                continue
+        except OSError:
+            continue
+        try:
+            with open(full_path, "r", encoding="utf-8", errors="ignore") as f:
+                content = f.read()
+        except (OSError, UnicodeDecodeError):
+            continue
+        if SECRET_CONTENT_PATTERN.search(content):
+            found.append(rel_path)
+    return sorted(set(found))
+
+
+def _preflight_codex_secrets(repo_root: str) -> "list[str]":
+    """Combined preflight: returns unique repo-relative paths flagged by
+    EITHER the filename scan OR the content scan. Empty list = clean repo,
+    safe to invoke codex without `--allow-secrets`."""
+    return sorted(set(
+        _scan_sensitive_files(repo_root) + _scan_sensitive_content(repo_root)
+    ))
+
+
 def _print_sensitive_warning(
     repo_root: str, found: "list[str]", abort: bool
 ) -> None:
@@ -1264,7 +1373,8 @@ def _print_sensitive_warning(
     print(f"  --cd {repo_root}", file=sys.stderr)
     print(
         f"Detected {len(found)} potentially sensitive file(s) "
-        "(recursive scan, includes gitignored files):",
+        "(recursive scan: filename patterns + content secret-regex; "
+        "includes gitignored files):",
         file=sys.stderr,
     )
     for f in found[:20]:
@@ -1725,7 +1835,8 @@ def main() -> None:
         help=(
             "Codex backend only. By default, the script aborts before "
             "invoking codex if it detects potentially sensitive files in the "
-            "repo (.env, *.pem, id_rsa, secrets.*, etc.; gitignore-aware). "
+            "repo (.env, *.pem, id_rsa, secrets.*, etc., plus content "
+            "secret-regex; recursive scan including gitignored files). "
             "Pass this flag to acknowledge the surface and proceed anyway. "
             "Codex CAN read those files inside its agentic loop (under the "
             "read-only sandbox)."
@@ -2069,7 +2180,7 @@ def main() -> None:
 
     if backend == "codex":
         codex_repo_root = args.repo_root or os.getcwd()
-        sensitive = _scan_sensitive_files(codex_repo_root)
+        sensitive = _preflight_codex_secrets(codex_repo_root)
         if sensitive:
             _print_sensitive_warning(
                 codex_repo_root, sensitive, abort=not args.allow_secrets
diff --git a/tests/test_openai_review.py b/tests/test_openai_review.py