diff --git a/COMPATIBILITY.md b/COMPATIBILITY.md index 07c1d1b9..3c9791b8 100644 --- a/COMPATIBILITY.md +++ b/COMPATIBILITY.md @@ -111,6 +111,7 @@ Taint rules (`mode: taint`): - `mode: taint` for Python, JavaScript/TypeScript, Go, Java, C, and Kotlin; taint rules targeting other languages are skipped with a warning - `pattern-sources`, `pattern-sinks`, `pattern-sanitizers` — each entry may be a single `pattern:` string, a `pattern-either:` list (nested `pattern-either:` is supported and flattens recursively), or a `patterns:` AND-block (see below) - **`patterns:` AND-blocks inside source/sink/sanitizer entries** — foxguard extracts all `pattern:` and `pattern-either:` sub-items as expressible node-shape matchers. Sub-items that are constraint-only operators (`pattern-inside:`, `pattern-not-inside:`, `pattern-not:`, `pattern-not-regex:`, `focus-metavariable:`, `metavariable-regex:`, `metavariable-comparison:`, `metavariable-pattern:`, `metavariable-analysis:`, `metavariable-type:`) are **dropped with a per-key warning**. This makes the compiled matcher **slightly broader** than the original Semgrep rule (the AND narrowing is lost), so foxguard may report findings that Semgrep would suppress. This broadening is intentional — it is better to over-report than to silently drop the rule. A `patterns:` block that produces no expressible matcher (only constraint-only sub-items) is warn-skipped without aborting sibling entries. If all source or sink entries are warn-skipped and none survive, the whole rule is skipped. +- **Parameter-as-source shape (`focus-metavariable` + a function-signature `pattern-inside`/`pattern`)** — a `pattern-sources` `patterns:` block of the form "a metavariable `$X` that is a parameter of an enclosing function" (i.e. a `focus-metavariable: $X` or bare `pattern: $X` together with a function-definition context whose parameter list contains `$X`) is recognised and compiled to an **any-function-parameter** taint source. Every parameter of every function/method in the file is seeded as tainted (matching Semgrep's any-parameter semantics for this shape). Supported for Python, JavaScript/TypeScript, Go, Java, C, Kotlin, Ruby, and PHP; C# carries the source inertly (no parameter-scope seeding). The recognition is bounded: the seed metavariable must genuinely appear inside the first parameter list of a function-definition pattern in the same block, so an unrelated focus metavariable (e.g. `focus-metavariable: $X` over `pattern: get_input($X)`) is *not* treated as a parameter source and falls through to the normal graceful-degradation extraction. - Supported `pattern:` shapes: - bare identifier (`request`) — a source-only shape compiled to a parameter-name match - dotted attribute chain (`request.data`, `request.json`) — nested chains flatten to `leftmost root + outermost field` (matches the engine's one-level attribute propagation) diff --git a/docs/parity/registry-coverage.md b/docs/parity/registry-coverage.md index 88416a11..52902b90 100644 --- a/docs/parity/registry-coverage.md +++ b/docs/parity/registry-coverage.md @@ -12,10 +12,10 @@ Measures how well foxguard's existing Semgrep-compat YAML loader (`src/rules/sem | Rule files scanned | 2070 | | Files with YAML parse errors | 0 | | Total rules | 2144 | -| Rules loaded OK | 2004 (93.5%) | -| Rules skipped | 140 (6.5%) | +| Rules loaded OK | 2028 (94.6%) | +| Rules skipped | 116 (5.4%) | -**Headline load rate: 93.5%** (2004 / 2144 rules). +**Headline load rate: 94.6%** (2028 / 2144 rules). ## Skip-reason histogram @@ -23,11 +23,11 @@ Sorted by frequency. The reason names the operator/key that blocks the rule toda | Skip reason | Rules | % of skipped | % of all rules | |---|---:|---:|---:| -| `mode: taint (unsupported shape)` | 126 | 90.0% | 5.9% | -| `generic mode (languages: [generic])` | 5 | 3.6% | 0.2% | -| `taint: pattern-propagators` | 5 | 3.6% | 0.2% | -| `mode: taint (unsupported language: apex)` | 3 | 2.1% | 0.1% | -| `mode: taint (unsupported language: swift)` | 1 | 0.7% | 0.0% | +| `mode: taint (unsupported shape)` | 102 | 87.9% | 4.8% | +| `generic mode (languages: [generic])` | 5 | 4.3% | 0.2% | +| `taint: pattern-propagators` | 5 | 4.3% | 0.2% | +| `mode: taint (unsupported language: apex)` | 3 | 2.6% | 0.1% | +| `mode: taint (unsupported language: swift)` | 1 | 0.9% | 0.0% | ## Priority order — operator/feature backlog @@ -35,10 +35,10 @@ Matcher capabilities (implementable in `semgrep_compat.rs` / `semgrep_taint.rs`) | Rank | Capability to add | Rules unlocked | |---:|---|---:| -| 1 | `mode: taint (unsupported shape)` | 126 | +| 1 | `mode: taint (unsupported shape)` | 102 | | 2 | `taint: pattern-propagators` | 5 | -Operator/feature gaps account for **131 rules** (6.1% of all rules). Closing the top of this list is the highest-leverage parity work that does not require a new parser. +Operator/feature gaps account for **107 rules** (5.0% of all rules). Closing the top of this list is the highest-leverage parity work that does not require a new parser. ## Priority order — missing language grammars @@ -58,18 +58,18 @@ Language is the rule's first declared language (js/ts/jsx/tsx collapsed to `java | Language | Total | Loaded | Skipped | Load rate | |---|---:|---:|---:|---:| -| python | 423 | 401 | 22 | 94.8% | +| python | 423 | 402 | 21 | 95.0% | | hcl | 359 | 359 | 0 | 100.0% | -| javascript | 243 | 208 | 35 | 85.6% | +| javascript | 243 | 221 | 22 | 90.9% | | regex | 237 | 237 | 0 | 100.0% | -| java | 131 | 111 | 20 | 84.7% | +| java | 131 | 115 | 16 | 87.8% | | generic | 103 | 98 | 5 | 95.1% | | yaml | 100 | 100 | 0 | 100.0% | -| go | 97 | 84 | 13 | 86.6% | +| go | 97 | 86 | 11 | 88.7% | | ruby | 92 | 76 | 16 | 82.6% | | php | 63 | 49 | 14 | 77.8% | | solidity | 50 | 49 | 1 | 98.0% | -| csharp | 48 | 38 | 10 | 79.2% | +| csharp | 48 | 42 | 6 | 87.5% | | dockerfile | 39 | 39 | 0 | 100.0% | | ocaml | 34 | 34 | 0 | 100.0% | | scala | 23 | 23 | 0 | 100.0% | @@ -90,13 +90,13 @@ Language is the rule's first declared language (js/ts/jsx/tsx collapsed to `java ## Top skip reasons per language - **apex**: `mode: taint (unsupported language: apex)` (3) -- **csharp**: `mode: taint (unsupported shape)` (9), `taint: pattern-propagators` (1) +- **csharp**: `mode: taint (unsupported shape)` (5), `taint: pattern-propagators` (1) - **generic**: `generic mode (languages: [generic])` (5) -- **go**: `mode: taint (unsupported shape)` (13) -- **java**: `mode: taint (unsupported shape)` (17), `taint: pattern-propagators` (3) -- **javascript**: `mode: taint (unsupported shape)` (35) +- **go**: `mode: taint (unsupported shape)` (11) +- **java**: `mode: taint (unsupported shape)` (13), `taint: pattern-propagators` (3) +- **javascript**: `mode: taint (unsupported shape)` (22) - **php**: `mode: taint (unsupported shape)` (14) -- **python**: `mode: taint (unsupported shape)` (22) +- **python**: `mode: taint (unsupported shape)` (21) - **ruby**: `mode: taint (unsupported shape)` (15), `taint: pattern-propagators` (1) - **solidity**: `mode: taint (unsupported shape)` (1) - **swift**: `mode: taint (unsupported language: swift)` (1) diff --git a/src/rules/c_taint.rs b/src/rules/c_taint.rs index e1952d65..9e2fd423 100644 --- a/src/rules/c_taint.rs +++ b/src/rules/c_taint.rs @@ -527,7 +527,9 @@ fn collect_param_sources( if let Some(name) = param_name { for matcher in &spec.sources { if let NodeMatcher::ParamName { names, description } = matcher { - if names.iter().any(|n| n == &name) { + if names.iter().any(|n| n == &name) + || crate::rules::taint_engine::param_names_are_wildcard(names) + { out.push(TaintSource { var_name: Some(name.clone()), description: description.clone(), diff --git a/src/rules/go_taint.rs b/src/rules/go_taint.rs index ef3685dc..cbc994ce 100644 --- a/src/rules/go_taint.rs +++ b/src/rules/go_taint.rs @@ -542,7 +542,9 @@ fn seed_param_sources(params: Node<'_>, source: &str, spec: &TaintSpec, state: & let param_name = node_text(inner, source); for matcher in &spec.sources { if let NodeMatcher::ParamName { names, description } = matcher { - if names.iter().any(|n| n == param_name) { + if names.iter().any(|n| n == param_name) + || crate::rules::taint_engine::param_names_are_wildcard(names) + { let line = inner.start_position().row + 1; state.taint(param_name.to_string(), description.clone(), line); break; diff --git a/src/rules/java_taint.rs b/src/rules/java_taint.rs index 48ef01c3..a97f4e7d 100644 --- a/src/rules/java_taint.rs +++ b/src/rules/java_taint.rs @@ -253,11 +253,17 @@ fn collect_param_sources( ) { let mut annotation_names: Vec<&str> = Vec::new(); let mut bare_names: Vec<&str> = Vec::new(); + // `$`-prefixed name (`$PARAM`) is the any-parameter wildcard compiled from a + // Semgrep `pattern-inside: function(...,$ARG,...) + focus-metavariable: $ARG` + // source block: seed *every* parameter of the enclosing scope. + let mut wildcard = false; for matcher in &spec.sources { if let NodeMatcher::ParamName { names, .. } = matcher { for name in names { if let Some(rest) = name.strip_prefix('@') { annotation_names.push(rest); + } else if name == crate::rules::taint_engine::ANY_PARAM_WILDCARD { + wildcard = true; } else { bare_names.push(name.as_str()); } @@ -283,7 +289,7 @@ fn collect_param_sources( hops: 0, }, ); - } else if bare_names.contains(&name) { + } else if bare_names.contains(&name) || wildcard { state.taint( name.to_string(), TaintInfo { diff --git a/src/rules/javascript_taint.rs b/src/rules/javascript_taint.rs index 926c5e3f..dde28ebc 100644 --- a/src/rules/javascript_taint.rs +++ b/src/rules/javascript_taint.rs @@ -1125,7 +1125,9 @@ impl<'a> TaintLanguageAdapter> for JsTaintAdapter { let line = single.start_position().row + 1; for matcher in &ctx.spec.sources { if let NodeMatcher::ParamName { names, description } = matcher { - if names.iter().any(|n| n == name) { + if names.iter().any(|n| n == name) + || crate::rules::taint_engine::param_names_are_wildcard(names) + { state.taint(name.to_string(), description.clone(), line); break; } @@ -1202,7 +1204,9 @@ fn seed_param_sources(params: Node<'_>, source: &str, spec: &TaintSpec, state: & for matcher in &spec.sources { if let NodeMatcher::ParamName { names, description } = matcher { - if names.iter().any(|n| n == param_name) { + if names.iter().any(|n| n == param_name) + || crate::rules::taint_engine::param_names_are_wildcard(names) + { let line = child.start_position().row + 1; state.taint(param_name.to_string(), description.clone(), line); break; diff --git a/src/rules/kotlin_taint.rs b/src/rules/kotlin_taint.rs index 85aa4410..7d7df7aa 100644 --- a/src/rules/kotlin_taint.rs +++ b/src/rules/kotlin_taint.rs @@ -447,11 +447,17 @@ fn collect_param_sources( // Collect annotation strings and bare names from the spec once. let mut annotation_names: Vec<&str> = Vec::new(); let mut bare_names: Vec<&str> = Vec::new(); + // `$`-prefixed name (`$PARAM`) is the any-parameter wildcard: seed every + // parameter of the function (compiled from a Semgrep + // `pattern-inside: fun(...,$ARG,...) + focus-metavariable: $ARG` block). + let mut wildcard = false; for matcher in &spec.sources { if let NodeMatcher::ParamName { names, .. } = matcher { for name in names { if let Some(rest) = name.strip_prefix('@') { annotation_names.push(rest); + } else if name == crate::rules::taint_engine::ANY_PARAM_WILDCARD { + wildcard = true; } else { bare_names.push(name.as_str()); } @@ -494,7 +500,7 @@ fn collect_param_sources( description: format!("@{} parameter '{}'", ann, name), line: ch.start_position().row + 1, }); - } else if bare_names.contains(&name) { + } else if bare_names.contains(&name) || wildcard { out.push(TaintSource { var_name: Some(name.to_string()), description: format!("parameter '{}'", name), diff --git a/src/rules/php_taint.rs b/src/rules/php_taint.rs index fce38fa2..daf37e9f 100644 --- a/src/rules/php_taint.rs +++ b/src/rules/php_taint.rs @@ -392,7 +392,9 @@ fn seed_param_sources(params: Node<'_>, source: &str, spec: &TaintSpec, state: & let name = node_text(v, source); for matcher in &spec.sources { if let NodeMatcher::ParamName { names, description } = matcher { - if names.iter().any(|n| n == name) { + if names.iter().any(|n| n == name) + || crate::rules::taint_engine::param_names_are_wildcard(names) + { let line = v.start_position().row + 1; state.taint(name.to_string(), description.clone(), line); break; diff --git a/src/rules/python_taint.rs b/src/rules/python_taint.rs index d1e17ea9..6c1084cf 100644 --- a/src/rules/python_taint.rs +++ b/src/rules/python_taint.rs @@ -465,7 +465,9 @@ fn seed_param_sources(params: Node<'_>, source: &str, spec: &TaintSpec, state: & for matcher in &spec.sources { if let NodeMatcher::ParamName { names, description } = matcher { - if names.iter().any(|n| n == param_name) { + if names.iter().any(|n| n == param_name) + || crate::rules::taint_engine::param_names_are_wildcard(names) + { let line = child.start_position().row + 1; state.taint(param_name.to_string(), description.clone(), line); break; diff --git a/src/rules/ruby_taint.rs b/src/rules/ruby_taint.rs index 8d816f4b..eac421f4 100644 --- a/src/rules/ruby_taint.rs +++ b/src/rules/ruby_taint.rs @@ -367,7 +367,9 @@ fn seed_param_sources(params: Node<'_>, source: &str, spec: &TaintSpec, state: & let param_name = node_text(child, source); for matcher in &spec.sources { if let NodeMatcher::ParamName { names, description } = matcher { - if names.iter().any(|n| n == param_name) { + if names.iter().any(|n| n == param_name) + || crate::rules::taint_engine::param_names_are_wildcard(names) + { let line = child.start_position().row + 1; state.taint(param_name.to_string(), description.clone(), line); break; diff --git a/src/rules/semgrep_taint.rs b/src/rules/semgrep_taint.rs index 63f46123..6d2e639a 100644 --- a/src/rules/semgrep_taint.rs +++ b/src/rules/semgrep_taint.rs @@ -1753,6 +1753,33 @@ fn compile_entry( } } Some("patterns") => { + // ── Parameter-as-source shape (focus-metavariable + a function- + // signature `pattern-inside`/`pattern`) ───────────────────── + // + // The dominant rejected taint-source shape is "a parameter of the + // enclosing handler/function is user-controlled", written as + // + // patterns: + // - pattern-inside: | + // function ... (..., $ARG, ...) { ... } + // - focus-metavariable: $ARG + // + // or the AWS-Lambda variant `pattern: $EVENT` + a + // `pattern-either:` of `pattern-inside` handler signatures binding + // `$EVENT` as a parameter. None of the generic node shapes express + // this, so the block compiles to nothing and the rule is rejected. + // + // When the block is the parameter-as-source shape we compile it to + // a single any-parameter wildcard source + // (`ParamName { names: [ANY_PARAM_WILDCARD] }`); each engine seeds + // every function parameter as tainted (see + // `taint_engine::param_names_are_wildcard`). Only meaningful as a + // SOURCE — a parameter is a taint origin, not a destination. + if let MatcherRole::Source = role { + if try_compile_param_source_block(v, out) { + return; + } + } // `patterns:` is a Semgrep AND-block: all sub-items must hold // simultaneously. foxguard's taint engine cannot express all AND // semantics (no nested scope / contextual constraints), so we @@ -1882,6 +1909,171 @@ fn compile_patterns_block( } } +/// Try to recognise the "parameter-as-source" shape in a `patterns:` source +/// block and, if found, push a single any-parameter wildcard source matcher. +/// +/// Returns `true` (and pushes one matcher) when the block both: +/// 1. names a seed metavariable `$X` — either as a `focus-metavariable: $X` +/// sub-item, or as a bare `pattern: $X` sub-item; and +/// 2. contains a function-signature context (a `pattern:` or +/// `pattern-inside:` whose text declares a function/method whose +/// *parameter list* contains that same `$X`). +/// +/// Discipline: we require the seed metavariable to appear *inside a parameter +/// list* of a function-definition pattern in the SAME block, so we never seed +/// "all parameters" off an unrelated metavariable. The compiled source is the +/// [`ANY_PARAM_WILDCARD`] sentinel — engines seed every function parameter as +/// tainted, matching Semgrep's any-parameter semantics for this shape. +/// +/// Returns `false` (and pushes nothing) for any other block shape, leaving the +/// caller to fall through to the normal graceful-degradation extraction. +fn try_compile_param_source_block(v: &YamlValue, out: &mut Vec) -> bool { + let Some(items) = v.as_sequence() else { + return false; + }; + + // Collect, across the whole block (recursing into pattern-either), the set + // of focus/bare-pattern seed metavariables and the set of function-signature + // pattern texts. + let mut seeds: Vec = Vec::new(); + let mut signature_texts: Vec = Vec::new(); + collect_param_source_parts(items, &mut seeds, &mut signature_texts); + + if seeds.is_empty() || signature_texts.is_empty() { + return false; + } + + // The seed metavariable must appear as a parameter of at least one + // function-signature context in the block. + let matched = seeds.iter().any(|seed| { + signature_texts + .iter() + .any(|sig| signature_has_param(sig, seed)) + }); + if !matched { + return false; + } + + out.push(GenericMatcher::ParamName { + names: vec![crate::rules::taint_engine::ANY_PARAM_WILDCARD.to_string()], + description: "untrusted function parameter".to_string(), + }); + true +} + +/// Walk a `patterns:` block (and nested `pattern-either:` lists) collecting +/// seed metavariables (`focus-metavariable: $X` and bare `pattern: $X`) and the +/// text of function-signature contexts (`pattern:` / `pattern-inside:` whose +/// value is a multi-line function definition). +fn collect_param_source_parts( + items: &[YamlValue], + seeds: &mut Vec, + signature_texts: &mut Vec, +) { + for item in items { + let Some(map) = item.as_mapping() else { + continue; + }; + for (k, val) in map { + match k.as_str() { + Some("focus-metavariable") => { + if let Some(s) = val.as_str() { + let mv = s.trim(); + if is_metavariable(mv) { + seeds.push(mv.to_string()); + } + } + } + Some("pattern") => { + if let Some(s) = val.as_str() { + let t = s.trim(); + if is_metavariable(t) { + seeds.push(t.to_string()); + } else if is_function_definition_pattern(t) { + signature_texts.push(t.to_string()); + } + } + } + Some("pattern-inside") => { + if let Some(s) = val.as_str() { + if is_function_definition_pattern(s) { + signature_texts.push(s.to_string()); + } + } + } + Some("pattern-either") => { + if let Some(seq) = val.as_sequence() { + collect_param_source_parts(seq, seeds, signature_texts); + } + } + Some("patterns") => { + if let Some(seq) = val.as_sequence() { + collect_param_source_parts(seq, seeds, signature_texts); + } + } + _ => {} + } + } + } +} + +/// True when `pat` looks like a function/method definition pattern: it declares +/// a function with a parameter list. We accept the common cross-language +/// keywords plus the assignment-to-function form (`exports.handler = function +/// (...)`, `$F = function (...)`), requiring a `(` ... `)` parameter list. +fn is_function_definition_pattern(pat: &str) -> bool { + let p = pat.trim(); + if !(p.contains('(') && p.contains(')')) { + return false; + } + // A leading definition keyword anywhere in the (possibly multi-line) pattern. + p.contains("function") // JS/TS/PHP + || p.contains("func ") // Go + || p.starts_with("def ") + || p.contains("\ndef ") // Python/Ruby/Scala + || p.contains("fun ") // Kotlin + || p.contains("=>") // arrow / lambda + // A Java/C-style typed method signature: `$T $M(...) { ... }` — a + // metavariable or identifier return type followed by a name and a + // parameter list and a brace body. + || (p.contains('{') && p.contains('$')) +} + +/// True when the function-signature pattern `sig` lists `seed` (a metavariable +/// like `$ARG`) inside its FIRST parameter list `( ... )`. This bounds the +/// any-parameter seed to a metavariable that is genuinely a parameter, so we +/// never seed off an unrelated metavariable elsewhere in the pattern. +fn signature_has_param(sig: &str, seed: &str) -> bool { + let Some(open) = sig.find('(') else { + return false; + }; + // Find the matching close paren for this first '(' (balanced). + let bytes = sig.as_bytes(); + let mut depth = 0i32; + let mut close = None; + for (i, &b) in bytes.iter().enumerate().skip(open) { + match b { + b'(' => depth += 1, + b')' => { + depth -= 1; + if depth == 0 { + close = Some(i); + break; + } + } + _ => {} + } + } + let Some(close) = close else { + return false; + }; + let params = &sig[open + 1..close]; + // Token-boundary match so `$ARG` does not match `$ARGUMENT`. + params + .split(|c: char| !is_ident_char(c) && c != '$') + .any(|tok| tok == seed) +} + /// Compile a Bash-specific pattern (shell command or command substitution) /// into a `Call` matcher keyed by the command name. /// @@ -6253,4 +6445,327 @@ object Ctrl { findings ); } + + // ── Parameter-as-source shape: focus-metavariable + function-signature + // pattern-inside inside a taint pattern-sources block ──────────────── + + /// JS `lang/detect-child-process` shape: a `patterns:` source block with a + /// `pattern-inside: function ...(...,$FUNC,...)` context plus + /// `focus-metavariable: $FUNC`. Compiles to the any-parameter wildcard + /// source. A function parameter flowing to `exec(...)` must fire. + #[test] + fn js_param_source_focus_inside_fires() { + use crate::engine::parser::parse_file; + + let rule = compiled( + r#" +id: js-child-process-param +mode: taint +languages: [javascript] +severity: ERROR +message: "Function argument reaches child_process.exec" +pattern-sources: + - patterns: + - pattern-inside: | + function ... (...,$FUNC,...) { + ... + } + - focus-metavariable: $FUNC +pattern-sinks: + - pattern: exec($CMD,...) +"#, + ); + // The source block must compile to the any-parameter wildcard. + assert!( + matches!( + rule.spec.sources.as_slice(), + [GenericMatcher::ParamName { names, .. }] + if names == &[crate::rules::taint_engine::ANY_PARAM_WILDCARD.to_string()] + ), + "expected wildcard ParamName source, got {:?}", + rule.spec.sources + ); + + let src = r#" +function run(name, cmd) { + exec(cmd); +} +"#; + let tree = parse_file(src, Language::JavaScript).expect("js fixture should parse"); + let findings = rule.check(src, &tree); + assert_eq!( + findings.len(), + 1, + "a function parameter reaching exec() must fire, got {:?}", + findings + ); + } + + /// JS near-miss: the value reaching `exec(...)` is NOT a function parameter + /// (it is a module-level constant), so the wildcard-param source must not + /// taint it and the rule must not fire. + #[test] + fn js_param_source_non_param_does_not_fire() { + use crate::engine::parser::parse_file; + + let rule = compiled( + r#" +id: js-child-process-param-safe +mode: taint +languages: [javascript] +severity: ERROR +message: "Function argument reaches child_process.exec" +pattern-sources: + - patterns: + - pattern-inside: | + function ... (...,$FUNC,...) { + ... + } + - focus-metavariable: $FUNC +pattern-sinks: + - pattern: exec($CMD,...) +"#, + ); + + // A literal local, not a parameter, reaching exec(). No parameter flows + // anywhere, so the any-parameter seed taints nothing relevant. + let src = r#" +function run() { + const cmd = "ls -la"; + exec(cmd); +} +"#; + let tree = parse_file(src, Language::JavaScript).expect("js fixture should parse"); + let findings = rule.check(src, &tree); + assert!( + findings.is_empty(), + "a non-parameter constant must not fire, got {:?}", + findings + ); + } + + /// Python AWS-Lambda shape: `pattern: $EVENT` plus a `pattern-either:` of + /// `pattern-inside:` handler signatures binding `$EVENT` as a parameter. + /// A handler parameter flowing to `subprocess.call(...)` must fire. + #[test] + fn python_param_source_bare_pattern_inside_fires() { + use crate::engine::parser::parse_file; + + let rule = compiled( + r#" +id: py-lambda-param +mode: taint +languages: [python] +severity: ERROR +message: "Lambda event reaches subprocess" +pattern-sources: + - patterns: + - pattern: $EVENT + - pattern-inside: | + def $HANDLER($EVENT, $CONTEXT): + ... +pattern-sinks: + - pattern: subprocess.call($X) +"#, + ); + assert!( + matches!( + rule.spec.sources.as_slice(), + [GenericMatcher::ParamName { names, .. }] + if names == &[crate::rules::taint_engine::ANY_PARAM_WILDCARD.to_string()] + ), + "expected wildcard ParamName source, got {:?}", + rule.spec.sources + ); + + let src = r#" +def handler(event, context): + cmd = event + subprocess.call(cmd) +"#; + let tree = parse_file(src, Language::Python).expect("python fixture should parse"); + let findings = rule.check(src, &tree); + assert_eq!( + findings.len(), + 1, + "handler param reaching subprocess.call must fire, got {:?}", + findings + ); + } + + /// Python near-miss: same rule, but the value reaching the sink is a + /// hardcoded constant unrelated to any parameter — must not fire. + #[test] + fn python_param_source_constant_does_not_fire() { + use crate::engine::parser::parse_file; + + let rule = compiled( + r#" +id: py-lambda-param-safe +mode: taint +languages: [python] +severity: ERROR +message: "Lambda event reaches subprocess" +pattern-sources: + - patterns: + - pattern: $EVENT + - pattern-inside: | + def $HANDLER($EVENT, $CONTEXT): + ... +pattern-sinks: + - pattern: subprocess.call($X) +"#, + ); + + let src = r#" +def handler(event, context): + cmd = "echo hello" + subprocess.call(cmd) +"#; + let tree = parse_file(src, Language::Python).expect("python fixture should parse"); + let findings = rule.check(src, &tree); + assert!( + findings.is_empty(), + "a hardcoded constant must not fire, got {:?}", + findings + ); + } + + /// A source `patterns:` block that names a focus metavariable which is NOT + /// a parameter of any function-signature context must NOT compile to the + /// any-parameter wildcard (guards against over-broad seeding). + #[test] + fn non_param_focus_block_is_not_treated_as_param_source() { + let v: YamlValue = serde_yaml_ng::from_str( + r#" +id: not-a-param-source +mode: taint +languages: [python] +severity: ERROR +message: m +pattern-sources: + - patterns: + - pattern: get_input($X) + - focus-metavariable: $X +pattern-sinks: + - pattern: eval($Y) +"#, + ) + .unwrap(); + match parse_taint_rule(&v) { + TaintRuleParse::Compiled(r) => { + // The `get_input($X)` pattern is a Call source (expressible), so + // the block compiles via graceful degradation — NOT to the + // any-parameter wildcard. + assert!( + !r.spec.sources.iter().any(|m| matches!( + m, + GenericMatcher::ParamName { names, .. } + if names.contains(&crate::rules::taint_engine::ANY_PARAM_WILDCARD.to_string()) + )), + "a focus on a call metavar must not become an any-parameter source: {:?}", + r.spec.sources + ); + } + other => panic!( + "expected compiled rule, got skip/nottaint: {:?}", + matches!(other, TaintRuleParse::Skip(_)) + ), + } + } + + /// Java AWS-Lambda shape: `focus-metavariable: $EVENT` + a typed + /// handler-signature `pattern`. A handler parameter flowing to a SQL string + /// concat sink must fire; the wildcard seeds the typed parameter. + #[test] + fn java_param_source_focus_typed_signature_fires() { + use crate::engine::parser::parse_file; + + let rule = compiled( + r#" +id: java-lambda-param +mode: taint +languages: [java] +severity: ERROR +message: "Handler param reaches SQL string" +pattern-sources: + - patterns: + - focus-metavariable: $EVENT + - pattern: | + $RT $HANDLER($TYPE $EVENT, Context $CTX) { + ... + } +pattern-sinks: + - pattern: stmt.executeQuery($Q) +"#, + ); + assert!( + matches!( + rule.spec.sources.as_slice(), + [GenericMatcher::ParamName { names, .. }] + if names == &[crate::rules::taint_engine::ANY_PARAM_WILDCARD.to_string()] + ), + "expected wildcard ParamName source, got {:?}", + rule.spec.sources + ); + + let src = r#" +class H { + String handle(String event, Context ctx) { + String q = event; + return stmt.executeQuery(q); + } +} +"#; + let tree = parse_file(src, Language::Java).expect("java fixture should parse"); + let findings = rule.check(src, &tree); + assert_eq!( + findings.len(), + 1, + "handler param reaching executeQuery must fire, got {:?}", + findings + ); + } + + /// Java near-miss: a hardcoded literal (not a parameter) reaching the sink + /// must not fire even though the wildcard seeds parameters. + #[test] + fn java_param_source_literal_does_not_fire() { + use crate::engine::parser::parse_file; + + let rule = compiled( + r#" +id: java-lambda-param-safe +mode: taint +languages: [java] +severity: ERROR +message: "Handler param reaches SQL string" +pattern-sources: + - patterns: + - focus-metavariable: $EVENT + - pattern: | + $RT $HANDLER($TYPE $EVENT, Context $CTX) { + ... + } +pattern-sinks: + - pattern: stmt.executeQuery($Q) +"#, + ); + + let src = r#" +class H { + String handle(String event, Context ctx) { + String q = "SELECT 1"; + return stmt.executeQuery(q); + } +} +"#; + let tree = parse_file(src, Language::Java).expect("java fixture should parse"); + let findings = rule.check(src, &tree); + assert!( + findings.is_empty(), + "a hardcoded SQL literal must not fire, got {:?}", + findings + ); + } } diff --git a/src/rules/taint_engine.rs b/src/rules/taint_engine.rs index 31037023..1cf5144c 100644 --- a/src/rules/taint_engine.rs +++ b/src/rules/taint_engine.rs @@ -621,6 +621,40 @@ pub(super) fn node_text<'a>(node: Node<'_>, source: &'a str) -> &'a str { &source[node.byte_range()] } +/// Sentinel name used by the bridge to compile an "any function parameter" +/// taint source. Chosen to be a string no real identifier (including a PHP +/// `$`-variable like `$_GET`) can equal, so the use-site matchers never fire +/// on it — only `seed_params` interprets it (via [`param_names_are_wildcard`]). +pub const ANY_PARAM_WILDCARD: &str = "$"; + +/// True when a `ParamName` matcher's name list designates the +/// "any function parameter" wildcard — i.e. it contains the +/// [`ANY_PARAM_WILDCARD`] sentinel. +/// +/// This is the seed-time semantics for the Semgrep taint source shape +/// +/// ```yaml +/// pattern-sources: +/// - patterns: +/// - pattern-inside: | +/// function ... (..., $ARG, ...) { ... } +/// - focus-metavariable: $ARG +/// ``` +/// +/// which means "every parameter of the enclosing function is a taint source". +/// The bridge ([`semgrep_taint`]) compiles such a block to +/// `ParamName { names: ["$PARAM"], .. }`; each engine's `seed_params` calls +/// this helper so a `$`-prefixed name seeds *all* parameters of the function +/// being analyzed, rather than only a literally-named one. +/// +/// Discipline: the wildcard fires ONLY at parameter-seeding time. Use-site +/// matchers (`match_source`) compare against the literal name `$PARAM`, which +/// no real identifier equals, so the wildcard never broadens an expression- +/// position match — only function parameters become sources. +pub(super) fn param_names_are_wildcard(names: &[String]) -> bool { + names.iter().any(|n| n == ANY_PARAM_WILDCARD) +} + pub(super) fn build_batched_taint_groups(rules: &[BatchedRule<'_>]) -> Vec { let mut groups: Vec> = Vec::new(); for (i, r) in rules.iter().enumerate() {