ai-semantic-source-scanner (`ai-codescan`)

AI-driven SAST pipeline for JavaScript / TypeScript / Python / Java / Kotlin / Go / Ruby / PHP / C# / Bash / YAML / HTML codebases. Combines deterministic static analysis (AST, SCIP, CodeQL, Semgrep, optional Joern) with LLM agents that nominate bug candidates, write per-finding reports, and validate exploits in a hardened Docker sandbox.

Five-stage pipeline with human-in-the-loop gates between every stage. Swappable LLM provider (claude, gemini, or codex).

prep ─→ Gate 1 ─→ analyze ─→ Gate 2 ─→ validate ─→ Gate 3 ─→ report

What it does

prep — snapshots the target, runs ts-morph + parse5 + tree-sitter for AST, builds a SCIP cross-reference index, runs CodeQL (and optionally Semgrep / Joern), ingests every flow into a per-target DuckDB.
nominate — wide-pass LLM proposes candidate bugs in three streams (CodeQL-traced, AI-discovered, model-extension proposals) with a one-line plain-English summary per item.
gate-1 / -2 / -3 — open the queue file in $EDITOR, mark y/n: per item, or --yes to accept all.
analyze — per-finding sub-agent reads a minimal flow slice, walks evidence, writes findings/<id>.md.
validate — per-finding PoC writer + sandboxed Docker runner; flips status to verified / rejected / poc_inconclusive.
report — renders verified findings as bug-bounty-ready markdown with CWE→severity inference.
visualize — Graphviz DOT/SVG/PNG of flows.

Requirements

Required	Why
`uv`	Python 3.13 venv + dep manager
`node` ≥ 22 + `pnpm`	AST worker (ts-morph + parse5 + tree-sitter)
`git`	Snapshots use `git worktree` for git targets

Optional	What you lose without it
`codeql` CLI 2.25+	The default static engine. Without it, `--engine llm-heavy` works as a fallback
`scip-typescript`	SCIP cross-reference symbol IDs (`npm i -g @sourcegraph/scip-typescript`)
`docker`	`validate` PoC sandbox. Falls back to local execution with `--no-sandbox`
`dot` (graphviz)	`visualize --fmt svg
`claude` / `gemini` / `codex`	Pick whichever LLM CLI you want; at least one must be on PATH
Joern	Adds a third engine to `--engine hybrid` (deferred — see Install)
Semgrep	Auto-installed via `uv sync` (no separate install)

Install

The installer is interactive. It will:

Verify required tools are present.
Set up the Python venv (uv venv, uv sync --all-groups).
Install Node worker deps (pnpm install inside ai_codescan/ast/node_worker/).
Report which optional integrations are present / missing.
Prompt explicitly about installing Joern (~1.5 GB) — defaults to "no".
Optionally copy bundled Claude Code skills into ~/.claude/skills/.
Run the test suite to verify.

git clone git@github.com:pwnpanda/ai-semantic-source-scanner.git
cd ai-semantic-source-scanner
bash scripts/install.sh

Non-interactive install (CI, dotfile bootstrap):

AICS_NONINTERACTIVE=1 AICS_INSTALL_JOERN=no bash scripts/install.sh
# or to force-install Joern without prompting:
AICS_NONINTERACTIVE=1 AICS_INSTALL_JOERN=yes bash scripts/install.sh

After install, the binary is available as uv run ai-codescan (or activate the venv with source .venv/bin/activate and use ai-codescan directly).

Quick start

# Clone a target (for a real scan use a real target)
git clone --depth 1 https://github.com/expressjs/express /tmp/express

# End-to-end Phase 1: prep + nominate + gate-1 with auto-accept
uv run ai-codescan run /tmp/express --target-bug-class injection --yes

# Continue through Phase 2: deep analysis → validate → report
uv run ai-codescan analyze
uv run ai-codescan gate-2 --yes
uv run ai-codescan validate
uv run ai-codescan gate-3 --yes
uv run ai-codescan report --report-dir ./report

Python target example

The same workflow drives Python projects — language is auto-detected from pyproject.toml / setup.py / setup.cfg / requirements.txt. Running prep on a Flask/FastAPI/Django service routes it through CodeQL Python (python-security-extended.qls), Semgrep (--config=auto), Joern's pysrc2cpg frontend, and tree-sitter-python AST extraction. Optional SCIP indexing via scip-python (npm i -g @sourcegraph/scip-python) adds cross-file symbol resolution.

uv run ai-codescan prep ./tests/fixtures/tiny-flask --engine hybrid
uv run ai-codescan run ./tests/fixtures/tiny-flask --target-bug-class injection --yes

Reports land in ./report/YYYY-MM-DD--<severity>--<vuln-class>--<component>.md.

Common commands

# Discovery + filtering
uv run ai-codescan list-bug-classes
uv run ai-codescan prep <target> --target-bug-class xss,sqli,idor

# Pick a different LLM provider per gate
uv run ai-codescan run <target> --llm-provider gemini --llm-model gemini-2.5-pro --yes
uv run ai-codescan analyze        --llm-provider codex  --llm-model o3
uv run ai-codescan validate       --llm-provider claude --llm-model opus

# Inspect the project DB
uv run ai-codescan query "SELECT cwe, COUNT(*) FROM flows GROUP BY cwe"
uv run ai-codescan flows --from <symbol-id>
uv run ai-codescan view  --file /path/in/snapshot.ts
uv run ai-codescan entrypoints

# Engine alternatives
uv run ai-codescan prep <target> --engine codeql       # default
uv run ai-codescan prep <target> --engine llm-heavy    # LLM walks flows; no CodeQL needed
uv run ai-codescan prep <target> --engine hybrid       # CodeQL + Semgrep + Joern (when on PATH), deduped

# Layer 5 — stored / async second-order taint
uv run ai-codescan taint-schema --run                  # populate storage_locations + writes
uv run ai-codescan taint-schema --show
uv run ai-codescan taint-schema --edit                 # hand-edit annotations

# Visualization
uv run ai-codescan visualize --fmt svg --cwe CWE-89 --out flows.svg
uv run ai-codescan visualize --fmt png --limit 50 --out flows.png

# Cache management
uv run ai-codescan cache list
uv run ai-codescan cache rm <repo-id>

Configuration

Flag	Default	Notes
`--engine {codeql,llm-heavy,hybrid}`	`codeql`	Pick the static analysis backbone
`--target-bug-class <list>`	all	Comma-separated; e.g. `injection,xss,idor` or `@injection`
`--llm-provider {claude,gemini,codex}`	`claude`	LLM CLI to drive each phase
`--llm-model <name>`	provider default	e.g. `opus`, `gemini-2.5-pro`, `o3`
`--temperature <float>`	`0.0`	Determinism by default
`--report-dir <path>`	`./report/`	Where final reports land
`--cache-dir <path>`	`~/.ai_codescan/repos/<id>/`	Per-target snapshot + DuckDB
`--commit <sha>`	`HEAD`	Pin git snapshot to a commit
`--yes`	off	Skip interactive HITL gates
`--cost-cap <usd>`	none	Abort phase when LLM spend exceeds this
`--no-sandbox`	off	`validate` runs PoC locally instead of in Docker

LLM provider/model + temperature are persisted to runs/<run_id>/run.json for auditability.

Joern (deferred install)

--engine hybrid is wired to use Joern in addition to CodeQL and Semgrep. The installer asks before downloading because of the size:

Download: ~1.5 GB (JVM + Joern distribution)
Disk: ~2 GB unpacked
Adds: broader cross-file taint coverage on JS/TS, Java, Python, Go, Kotlin
Without it: hybrid still runs CodeQL + Semgrep and dedupes; you only lose the third opinion

Install later if you skipped initially:

AICS_INSTALL_JOERN=yes bash scripts/install.sh

Project layout

ai_codescan/                         # Python package
├── snapshot.py                      # Read-only snapshot via git worktree or cp
├── stack_detect.py                  # Detect projects + frameworks + package manager
├── ast/                             # Node worker + Python wrapper
├── index/                           # SCIP indexer + DuckDB schema/ingestion
├── engines/                         # codeql, semgrep, joern (stub), hybrid, llm_heavy
├── ingest/sarif.py                  # Parse SARIF into flows
├── findings/                        # Finding model + queue
├── analyzer.py / nominator.py / validator.py
├── report.py / visualize.py / sandbox.py
├── runs/state.py                    # Cost ledger + run state
├── llm.py                           # Swappable provider abstraction
├── taxonomy/bug_classes.yaml        # 35 bug classes with CWE/CodeQL mappings
└── skills/                          # Bundled Claude Code skills
    ├── wide_nominator/
    ├── deep_analyzer/
    ├── validator/
    └── llm_heavy/

docs/superpowers/
├── specs/                           # Phase 1 + 2 design specs
└── plans/                           # Per-sub-plan implementation plans

tests/                               # 158 tests, ~2 min full suite
TRADEOFFS.md                         # Autonomous decisions + open follow-ups

Development

make check          # ruff + ty + pytest
make lint           # ruff check
make format         # ruff format
make typecheck      # ty check --error-on-warning
make test           # pytest

Quality gates: zero ruff warnings, zero ty errors, all tests green. ~158 tests, ~2 min.

Pointers

Tradeoffs and open questions: TRADEOFFS.md
Phase 1 design: docs/superpowers/specs/2026-05-08-ai-codescan-phase1-design.md
Phase 2 design: docs/superpowers/specs/2026-05-09-ai-codescan-phase2-design.md
Per-sub-plan implementation plans: docs/superpowers/plans/
Milestone tags: git tag -l lists 14 (phase-1a through phase-3).

Status

Phases 1, 2, and 3 are complete and tagged. JavaScript / TypeScript and Python are fully implemented (every layer of the pipeline — stack detection, CodeQL, Semgrep, Joern, AST extraction, SCIP indexing, storage-taint regexes, framework-aware entrypoint detection, fixtures, end-to-end smoke tests). Java, Go, Ruby, PHP, and C# reached MVP+: Java via javasrc2cpg + Spring/JAX-RS/Kafka entrypoints + tiny-spring SQLi fixture; Go via Joern's gosrc2cpg + Gin/Echo/Chi/Fiber/stdlib http.HandleFunc entrypoints + tiny-gin SQLi fixture; Ruby via tree-sitter-ruby AST + Rails routes / Sinatra DSL / Sidekiq workers + tiny-sinatra SQLi fixture (Joern's rubysrc2cpg is wired but beta — thin CPGs on DSL-heavy code); PHP via tree-sitter-php AST + Laravel/Symfony/Slim/WordPress entrypoints + tiny-slim SQLi fixture (CodeQL doesn't support PHP — coverage relies on Semgrep + Joern's php2cpg when php is on PATH); C# via CodeQL --build-mode=none + Joern csharpsrc2cpg + tree-sitter-c-sharp AST + ASP.NET Core attribute routing / minimal-API app.MapGet / Azure Functions entrypoints + tiny-aspnet SQLi fixture. Bare-source detection covers snippet repos and CTF challenges with no manifest. Joern install remains opt-in. See TRADEOFFS.md for the full list of autonomous decisions.

Claude Sessions

Session	Summary	Date
`python-language-support`	Added full-parity Python language support: stack_detect (pyproject/setup/requirements + framework + pkg-mgr detection), CodeQL Python query suite, Joern pysrc2cpg + Python source/sink patterns, tree-sitter-python AST worker, Python idiom regexes in storage_taint, tiny-flask fixture, end-to-end smoke test.	2026-05-09
`js-python-elevation`	Elevated JS/TS and Python from MVP to fully implemented: language-aware views.py (Python `#` comments), broader JS entrypoints (NestJS decorators, Next.js Pages/App Router, Remix loaders/actions), Python entrypoints (Flask/FastAPI/Django/Starlette + Celery/argparse), tightened Joern JS XSS receiver filter, optional scip-python integration, tiny-fastapi CWE-22 fixture, hybrid-mode integration test, README Python quickstart.	2026-05-09
`java-language-support`	Added Java MVP+ language support: stack_detect (Maven pom.xml + Gradle Groovy/KTS, framework detection for Spring Boot/Quarkus/Micronaut/Dropwizard/Helidon/Javalin/Vertx, multi-module skip, target/ skip), CodeQL Java (`--language=java-kotlin`, `--build-mode=none`, java-security-extended.qls), Joern's `javasrc2cpg` with annotated-parameter source detection (`@RequestParam`/`@RequestBody`/`@PathVariable`), tree-sitter-java AST worker emitting class/method/annotation symbols+xrefs, Java JDBC/Spring/Kafka idioms in storage_taint, Spring/JAX-RS/Kafka/Scheduled entrypoints, tiny-spring CWE-89 fixture.	2026-05-09
`go-language-support`	Added Go MVP+ language support: stack_detect (go.mod / go.work; framework detection for Gin/Echo/Chi/Fiber/Beego/Iris/gorilla/httprouter/fasthttp; vendor/ skip), CodeQL Go (`--language=go`, go-security-extended.qls), Joern's `gosrc2cpg` (`--language GOLANG`) with call-result source detection (`c.Query`, `r.URL.Query`, etc.), tree-sitter-go AST worker emitting function/method/struct/interface symbols and call xrefs, Go database/sql + go-redis + Spring-Kafka-equivalent idioms in storage_taint, stdlib `http.HandleFunc` + Gin/Echo/Chi/Fiber router-method entrypoints, tiny-gin CWE-89 fixture, narrowed Joern co-location fallback to JS-only (Go/Java/Python now rely solely on the data-flow engine).	2026-05-09
`ruby-language-support`	Added Ruby MVP+ language support: stack_detect (Gemfile / *.gemspec / config/application.rb; framework detection for Rails/Sinatra/Hanami/Grape/Roda/Padrino/Sidekiq/GoodJob/SolidQueue/Resque), CodeQL Ruby (`--language=ruby`, ruby-security-extended.qls; no toolchain on host), Joern's `rubysrc2cpg` (`--language RUBYSRC`; documented as beta — wired but produces thin CPGs on DSL-heavy code), tree-sitter-ruby AST worker emitting class/module/method/singleton-method symbols and call xrefs, ActiveRecord/Mysql2/PG/SQLite3 idioms in storage_taint, Rails routes DSL / Sinatra/Grape route DSLs / Sidekiq/ActiveJob worker includes / ARGV+OptionParser CLI entrypoints, tiny-sinatra CWE-89 fixture, fixed prep skip-parts to compare relative-to-base parts so `/tmp/` prefixes don't trip the Rails-`tmp/` skip.	2026-05-09
`php-and-bare-source`	Added PHP MVP+ language support and bare-source language detection. PHP: stack_detect (composer.json / wp-config.php / Drupal core marker; framework detection for Laravel/Symfony/CodeIgniter/Yii/Slim/CakePHP/WordPress/Drupal), Joern's `php2cpg` (`--language PHP`; needs `php` on PATH), tree-sitter-php@0.22.8 AST worker emitting function/method/class/interface/trait/enum symbols + call/attribute xrefs, PDO/mysqli/wpdb/Doctrine/Eloquent idioms in storage_taint, Laravel `Route::` / Symfony `#[Route]` / Slim `$app->` / WordPress `add_action`/`add_filter` / WP-CLI / Symfony Console entrypoints, tiny-slim CWE-89 fixture (CodeQL skipped — PHP not officially supported). Bare-source: stack_detect emits a Project per supported language (Python/Java/Go/Ruby/PHP) when no manifest is found, so snippet repos and CTF challenges are still indexed. 289 pass (+18).	2026-05-10
`csharp-language-support`	Added C#/.NET MVP+ language support: stack_detect (.csproj / .sln / Directory.Build.props / global.json; framework detection for ASP.NET Core MVC + minimal-APIs / Blazor WASM+Server / SignalR / Worker services / Azure Functions / gRPC / MassTransit / MediatR / Hangfire / Quartz / Orleans; sln-as-wrapper rule emits one project per .csproj when both coexist; bin/obj/.vs skip), CodeQL C# (`--language=csharp`, `--build-mode=none` GA in CLI 2.18.4, csharp-security-extended.qls), Joern's `csharpsrc2cpg` (`--language CSHARPSRC`), tree-sitter-c-sharp@0.21.3 AST worker emitting class/interface/record/struct/enum/method/constructor symbols + invocation/object-creation/attribute xrefs (so `[HttpGet]`/`[Route]`/`[Function]` surface to entrypoint detection), ADO.NET (SqlCommand) + Dapper + EF Core ExecuteSqlRaw idioms in storage_taint, ASP.NET Core attribute routing (`[Route]`, `[HttpGet]`, `[ApiController]`) + minimal-API `app.MapGet/Post/...` + Azure Functions `[Function]`/`[FunctionName]` entrypoints, tiny-aspnet minimal-API CWE-89 fixture. 301 pass (+12).	2026-05-10
`pipeline-quality-and-3-langs`	Quality lifts + three new languages. Real CodeQL integration tests for Java/Go/Ruby/C# (auto-fetch the language pack via `codeql pack download`, gate on host toolchain). Kotlin: tree-sitter-kotlin AST + Ktor `routing { get(...) {...} }` entrypoints + Spring-on-Kotlin annotations + Exposed ORM idioms in storage_taint + tiny-ktor SQLi fixture (reuses CodeQL `java-kotlin` extractor). Bash: tree-sitter-bash + bare-source ProjectKind.BASH + getopts/eval/source CLI markers + tiny-bash CWE-78 fixture (Semgrep covers; no CodeQL/Joern). YAML: ProjectKind.YAML + GitHub Actions / k8s / docker-compose / Helm detection + tree-sitter-yaml emitting yaml_key symbols + on:/run:/${{ github.event.* }} xrefs + tiny-actions CWE-78 fixture. schema.taint.yml LLM-suggested seeds: extended resolver prompt with per-language examples + bundled docs/schema.taint.yml.example + new `taint-schema --init` CLI. 324 pass (+23).	2026-05-10
`engines-polish-pass`	Pipeline-gap polish in three independent lifts. CodeQL pack auto-fetch hardening: new `engines.codeql.ensure_query_pack(language)` with bounded timeout + retry, called automatically before `codeql database analyze` so production runs work on fresh hosts (production move from test-only `_ensure_codeql_pack`); 5 unit tests cover retry/timeout/no-codeql/unknown-language paths. SCIP Java + SCIP Go: new `_build_scip_java`/`_build_scip_go` wrappers in `index/scip.py` with graceful skip when CLI is missing; `prep._scip_language_for_project` dispatches JAVA-kind (covers Kotlin via semanticdb) and GO-kind projects. Joern Kotlin: `kotlin → KOTLIN` mapping in `engines.joern`, Kotlin-heavy file-count heuristic in cli.py routes to `kotlin2cpg`, dedicated kotlin case in joern_queries.sc with Spring-on-Kotlin + Ktor request sources. Joern integration tests for tiny-vuln/tiny-flask/tiny-spring/tiny-gin/tiny-sinatra/tiny-slim/tiny-aspnet (gated on joern + per-language toolchain). Validator multi-language PoC support: `_PROFILES` table dispatches PoC scripts by file extension to `node:22-alpine`, `openjdk:21-slim`, `golang:1.22-alpine`, `ruby:3.3-alpine`, `php:8.3-cli-alpine`, `bash:5` (or `python:3.13-slim` legacy); TypeScript via `npx tsx`; `.cs` deferred (file-based `dotnet run` needs .NET 10+); `--no-sandbox` rejects non-Python with a clear `SandboxUnavailableError`. 343 unit pass (+19) plus 7 new gated Joern integration tests.	2026-05-10

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.github/workflows		.github/workflows
ai_codescan		ai_codescan
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Makefile		Makefile
README.md		README.md
TRADEOFFS.md		TRADEOFFS.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ai-semantic-source-scanner (`ai-codescan`)

What it does

Requirements

Install

Quick start

Python target example

Common commands

Configuration

Joern (deferred install)

Project layout

Development

Pointers

Status

Claude Sessions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ai-semantic-source-scanner (ai-codescan)

What it does

Requirements

Install

Quick start

Python target example

Common commands

Configuration

Joern (deferred install)

Project layout

Development

Pointers

Status

Claude Sessions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

ai-semantic-source-scanner (`ai-codescan`)

Packages