Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions COMPATIBILITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ Taint rules (`mode: taint`):
- **`patterns:` AND-blocks inside source/sink/sanitizer entries** — foxguard extracts all `pattern:` and `pattern-either:` sub-items as expressible node-shape matchers. Sub-items that are constraint-only operators (`pattern-inside:`, `pattern-not-inside:`, `pattern-not:`, `pattern-not-regex:`, `focus-metavariable:`, `metavariable-regex:`, `metavariable-comparison:`, `metavariable-pattern:`, `metavariable-analysis:`, `metavariable-type:`) are **dropped with a per-key warning**. This makes the compiled matcher **slightly broader** than the original Semgrep rule (the AND narrowing is lost), so foxguard may report findings that Semgrep would suppress. This broadening is intentional — it is better to over-report than to silently drop the rule. A `patterns:` block that produces no expressible matcher (only constraint-only sub-items) is warn-skipped without aborting sibling entries. If all source or sink entries are warn-skipped and none survive, the whole rule is skipped.
- **Parameter-as-source shape (`focus-metavariable` + a function-signature `pattern-inside`/`pattern`)** — a `pattern-sources` `patterns:` block of the form "a metavariable `$X` that is a parameter of an enclosing function" (i.e. a `focus-metavariable: $X` or bare `pattern: $X` together with a function-definition context whose parameter list contains `$X`) is recognised and compiled to an **any-function-parameter** taint source. Every parameter of every function/method in the file is seeded as tainted (matching Semgrep's any-parameter semantics for this shape). Supported for Python, JavaScript/TypeScript, Go, Java, C, Kotlin, Ruby, and PHP; C# carries the source inertly (no parameter-scope seeding). The recognition is bounded: the seed metavariable must genuinely appear inside the first parameter list of a function-definition pattern in the same block, so an unrelated focus metavariable (e.g. `focus-metavariable: $X` over `pattern: get_input($X)`) is *not* treated as a parameter source and falls through to the normal graceful-degradation extraction.
- **Focus-on-call-argument sink shape (`focus-metavariable` / bare `pattern: $X` + a call-context `pattern-inside`/`pattern`)** — the sink-side analog of the parameter-as-source shape. A `pattern-sinks`/`pattern-sanitizers` `patterns:` block of the form "the focused metavariable `$X` is an argument of a named call" (i.e. a `focus-metavariable: $X` or bare `pattern: $X` together with a call-context pattern such as `pattern-inside: $POOL.query($X, ...)`, `pattern: assert($X, ...)`, or `pattern: $DC.$METHOD($X, ...)`) is recognised and compiled to the existing `Call`/`MethodName` sink matcher for the call's callee. A concrete callee (`assert`, `redirect_to`, `YAML.load`) compiles to one `Call`; a `$RECV.$METH(...)` callee whose method is pinned by an anchored-alternation `metavariable-regex` (`^(query|execute)$`, `\b(include|require)\b`) compiles to one `MethodName` (or `Call`, for a `$FUNC(...)` callee) **per listed name**. These reuse the existing taint-gated call sinks, which only fire when a tracked-tainted value reaches the call's arguments — so the compiled sink is bounded to the concrete callee/method name AND tainted data, never an over-broad bare-node sink. The recognition is bounded: the focused metavariable must appear in the call's argument list (or the call must have a wildcard `...`/metavariable argument list), and a call whose callee/method is a free metavariable with **no** pinning `metavariable-regex` produces no matcher (we never invent a name) — such a block falls through to the normal graceful-degradation extraction. Supported for all taint languages (Python, JavaScript/TypeScript, Go, Java, C, Kotlin, Ruby, PHP); non-call contexts (binops, dict/object literals, subscript/property assignments) are not recognised by this shape.
- **Regex-bounded bare-metavariable callee sink shape (bare `pattern: $F(...)` / `$OBJ.$M(...)` + a pinning `metavariable-regex`)** — a `pattern-sinks`/`pattern-sanitizers` `patterns:` block that pairs a *bare-metavariable callee* call pattern with a `metavariable-regex` constraining that callee/method metavariable, e.g. `pattern: $EXEC(...)` + `metavariable-regex: { metavariable: $EXEC, regex: ^(system|exec|IO.popen)$ }`, or `pattern: $WRITER.$WRITE(...)` + `metavariable-regex: { metavariable: $WRITE, regex: ^(writerow|writerows|writeheader)$ }`. Without the regex, a bare-metavariable callee would match *every* call (universal → false-positive-unsafe) and is refused; **with** the `metavariable-regex` the match is bounded to callees/methods whose name matches, so the block is compiled to a name-constrained matcher that **enforces the regex at match time**: a bare `$F(...)` callee → a `CallRegex` matcher (the regex is tested against the *full callee text*, so dotted alternatives such as `IO.popen` match), and a `$OBJ.$M(...)` method callee → a `MethodNameRegex` matcher (the regex is tested against the *final method name*, any receiver). Any regex form is accepted (anchored alternations, fuzzy patterns such as `(?i)(.*password.*)`, and PCRE lookaround via `fancy-regex`). Like the other call sinks, these only fire when a tracked-tainted value reaches the call's arguments, so the compiled sink is bounded to a name-matching callee/method AND tainted data. The refusal is preserved: a bare-metavariable callee with **no** pinning `metavariable-regex` still compiles to nothing (the sink role empties and the rule is skipped) — foxguard never invents a callee name. Sink/sanitizer only (a call argument is a data-flow destination, not a taint origin); the regex matchers are matched by the shared call-sink resolver, so all taint languages benefit.
- Supported `pattern:` shapes:
- bare identifier (`request`) — a source-only shape compiled to a parameter-name match
- dotted attribute chain (`request.data`, `request.json`) — nested chains flatten to `leftmost root + outermost field` (matches the engine's one-level attribute propagation)
Expand Down
44 changes: 22 additions & 22 deletions docs/parity/registry-coverage.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,33 +12,33 @@ Measures how well foxguard's existing Semgrep-compat YAML loader (`src/rules/sem
| Rule files scanned | 2070 |
| Files with YAML parse errors | 0 |
| Total rules | 2144 |
| Rules loaded OK | 2051 (95.7%) |
| Rules skipped | 93 (4.3%) |
| Rules loaded OK | 2059 (96.0%) |
| Rules skipped | 85 (4.0%) |

**Headline load rate: 95.7%** (2051 / 2144 rules).
**Headline load rate: 96.0%** (2059 / 2144 rules).

## Skip-reason histogram

Sorted by frequency. The reason names the operator/key that blocks the rule today.

| Skip reason | Rules | % of skipped | % of all rules |
|---|---:|---:|---:|
| `mode: taint (unsupported shape)` | 80 | 86.0% | 3.7% |
| `generic mode (languages: [generic])` | 5 | 5.4% | 0.2% |
| `taint: pattern-propagators` | 4 | 4.3% | 0.2% |
| `mode: taint (unsupported language: apex)` | 3 | 3.2% | 0.1% |
| `mode: taint (unsupported language: swift)` | 1 | 1.1% | 0.0% |
| `mode: taint (unsupported shape)` | 72 | 84.7% | 3.4% |
| `generic mode (languages: [generic])` | 5 | 5.9% | 0.2% |
| `taint: pattern-propagators` | 4 | 4.7% | 0.2% |
| `mode: taint (unsupported language: apex)` | 3 | 3.5% | 0.1% |
| `mode: taint (unsupported language: swift)` | 1 | 1.2% | 0.0% |

## Priority order — operator/feature backlog

Matcher capabilities (implementable in `semgrep_compat.rs` / `semgrep_taint.rs`) ranked by how many registry rules each would unlock. These are independent of adding new language grammars. Build top-down.

| Rank | Capability to add | Rules unlocked |
|---:|---|---:|
| 1 | `mode: taint (unsupported shape)` | 80 |
| 1 | `mode: taint (unsupported shape)` | 72 |
| 2 | `taint: pattern-propagators` | 4 |

Operator/feature gaps account for **84 rules** (3.9% of all rules). Closing the top of this list is the highest-leverage parity work that does not require a new parser.
Operator/feature gaps account for **76 rules** (3.5% of all rules). Closing the top of this list is the highest-leverage parity work that does not require a new parser.

## Priority order — missing language grammars

Expand All @@ -58,16 +58,16 @@ Language is the rule's first declared language (js/ts/jsx/tsx collapsed to `java

| Language | Total | Loaded | Skipped | Load rate |
|---|---:|---:|---:|---:|
| python | 423 | 406 | 17 | 96.0% |
| python | 423 | 409 | 14 | 96.7% |
| hcl | 359 | 359 | 0 | 100.0% |
| javascript | 243 | 229 | 14 | 94.2% |
| javascript | 243 | 230 | 13 | 94.7% |
| regex | 237 | 237 | 0 | 100.0% |
| java | 131 | 115 | 16 | 87.8% |
| java | 131 | 116 | 15 | 88.5% |
| generic | 103 | 98 | 5 | 95.1% |
| yaml | 100 | 100 | 0 | 100.0% |
| go | 97 | 86 | 11 | 88.7% |
| ruby | 92 | 83 | 9 | 90.2% |
| php | 63 | 53 | 10 | 84.1% |
| go | 97 | 87 | 10 | 89.7% |
| ruby | 92 | 84 | 8 | 91.3% |
| php | 63 | 54 | 9 | 85.7% |
| solidity | 50 | 49 | 1 | 98.0% |
| csharp | 48 | 42 | 6 | 87.5% |
| dockerfile | 39 | 39 | 0 | 100.0% |
Expand All @@ -92,12 +92,12 @@ Language is the rule's first declared language (js/ts/jsx/tsx collapsed to `java
- **apex**: `mode: taint (unsupported language: apex)` (3)
- **csharp**: `mode: taint (unsupported shape)` (5), `taint: pattern-propagators` (1)
- **generic**: `generic mode (languages: [generic])` (5)
- **go**: `mode: taint (unsupported shape)` (11)
- **java**: `mode: taint (unsupported shape)` (13), `taint: pattern-propagators` (3)
- **javascript**: `mode: taint (unsupported shape)` (14)
- **php**: `mode: taint (unsupported shape)` (10)
- **python**: `mode: taint (unsupported shape)` (17)
- **ruby**: `mode: taint (unsupported shape)` (9)
- **go**: `mode: taint (unsupported shape)` (10)
- **java**: `mode: taint (unsupported shape)` (12), `taint: pattern-propagators` (3)
- **javascript**: `mode: taint (unsupported shape)` (13)
- **php**: `mode: taint (unsupported shape)` (9)
- **python**: `mode: taint (unsupported shape)` (14)
- **ruby**: `mode: taint (unsupported shape)` (8)
- **solidity**: `mode: taint (unsupported shape)` (1)
- **swift**: `mode: taint (unsupported language: swift)` (1)

Expand Down
24 changes: 24 additions & 0 deletions src/rules/csharp_taint.rs
Original file line number Diff line number Diff line change
Expand Up @@ -704,6 +704,8 @@ fn classify_source_expr(node: Node<'_>, source: &str, spec: &TaintSpec) -> Optio
}
}
NodeMatcher::MethodName { .. }
| NodeMatcher::CallRegex { .. }
| NodeMatcher::MethodNameRegex { .. }
| NodeMatcher::ReceiverCall { .. }
| NodeMatcher::MemberAssign { .. } => {
// Sink-only matchers, never a source.
Expand Down Expand Up @@ -775,6 +777,28 @@ fn matcher_matches_call(matcher: &NodeMatcher, node: Node<'_>, source: &str) ->
}
false
}
NodeMatcher::CallRegex { regex, .. } => {
// `$F(...)` + metavariable-regex on `$F`: callee text matches regex.
if node.kind() == "invocation_expression" {
if let Some(func) = node.child_by_field_name("function") {
let resolved = resolve_callee(func, source);
return regex.is_match(resolved);
}
}
false
}
NodeMatcher::MethodNameRegex { regex, .. } => {
// `$OBJ.$M(...)` + metavariable-regex on `$M`: final method name
// matches regex, any receiver.
if node.kind() == "invocation_expression" {
if let Some(func) = node.child_by_field_name("function") {
if let Some(method_name) = final_name_segment(func, source) {
return regex.is_match(method_name);
}
}
}
false
}
NodeMatcher::Attribute { root, field, .. } => {
// Match a member-assignment sink: e.g. `psi.Arguments = tainted`
// arrives as the LHS of an assignment_expression, not a call.
Expand Down
2 changes: 2 additions & 0 deletions src/rules/go_taint.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1261,6 +1261,8 @@ fn match_source(
// Seeded at function entry, not matched on expressions.
}
NodeMatcher::MethodName { .. }
| NodeMatcher::CallRegex { .. }
| NodeMatcher::MethodNameRegex { .. }
| NodeMatcher::ReceiverCall { .. }
| NodeMatcher::MemberAssign { .. }
| NodeMatcher::BinopFormat { .. }
Expand Down
2 changes: 2 additions & 0 deletions src/rules/javascript_taint.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1926,6 +1926,8 @@ fn match_source(
// Seeded at function entry, not matched on expressions.
}
NodeMatcher::MethodName { .. }
| NodeMatcher::CallRegex { .. }
| NodeMatcher::MethodNameRegex { .. }
| NodeMatcher::ReceiverCall { .. }
| NodeMatcher::MemberAssign { .. }
| NodeMatcher::BinopFormat { .. }
Expand Down
2 changes: 2 additions & 0 deletions src/rules/php_taint.rs
Original file line number Diff line number Diff line change
Expand Up @@ -905,6 +905,8 @@ fn match_source(node: Node<'_>, source: &str, spec: &TaintSpec) -> Option<String
}
}
NodeMatcher::MethodName { .. }
| NodeMatcher::CallRegex { .. }
| NodeMatcher::MethodNameRegex { .. }
| NodeMatcher::ReceiverCall { .. }
| NodeMatcher::MemberAssign { .. }
| NodeMatcher::BinopFormat { .. }
Expand Down
2 changes: 2 additions & 0 deletions src/rules/python_taint.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1433,6 +1433,8 @@ fn match_source(
// expressions.
}
NodeMatcher::MethodName { .. }
| NodeMatcher::CallRegex { .. }
| NodeMatcher::MethodNameRegex { .. }
| NodeMatcher::ReceiverCall { .. }
| NodeMatcher::MemberAssign { .. }
| NodeMatcher::BinopFormat { .. }
Expand Down
2 changes: 2 additions & 0 deletions src/rules/ruby_taint.rs
Original file line number Diff line number Diff line change
Expand Up @@ -779,6 +779,8 @@ fn match_source(node: Node<'_>, source: &str, spec: &TaintSpec) -> Option<String
}
}
NodeMatcher::MethodName { .. }
| NodeMatcher::CallRegex { .. }
| NodeMatcher::MethodNameRegex { .. }
| NodeMatcher::ReceiverCall { .. }
| NodeMatcher::MemberAssign { .. }
| NodeMatcher::BinopFormat { .. }
Expand Down
12 changes: 11 additions & 1 deletion src/rules/semgrep_compat.rs
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,16 @@ pub enum CompiledRegex {
}

impl CompiledRegex {
/// Returns the original (normalised) regex source string of whichever
/// backend compiled it. Used for stable fingerprinting/dedup of compiled
/// matchers that embed a regex.
pub fn as_str(&self) -> &str {
match self {
CompiledRegex::Fast(re) => re.as_str(),
CompiledRegex::Fancy(re) => re.as_str(),
}
}

/// Returns `true` if the pattern matches anywhere in `text`.
///
/// For the fancy-regex backend, a matcher error (such as exceeding the
Expand Down Expand Up @@ -2370,7 +2380,7 @@ fn build_either_matchers(
Ok(matchers)
}

fn compile_regex(pattern: &str) -> Result<CompiledRegex, String> {
pub(crate) fn compile_regex(pattern: &str) -> Result<CompiledRegex, String> {
// `\Z` is a Python/PCRE end-of-string anchor meaning "end of string before
// optional trailing newline". The Rust `regex` crate uses `$` with the
// `MULTILINE` flag off for the same semantics (match at absolute end).
Expand Down
Loading