Skip to content

feat(grammars): apex/html/xml/dart/clojure search grammars (→ 93.4%)#531

Merged
peaktwilight merged 1 commit into
mainfrom
feat/grammars-apex-html-xml-dart-clojure
Jun 18, 2026
Merged

feat(grammars): apex/html/xml/dart/clojure search grammars (→ 93.4%)#531
peaktwilight merged 1 commit into
mainfrom
feat/grammars-apex-html-xml-dart-clojure

Conversation

@peaktwilight

@peaktwilight peaktwilight commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

New tree-sitter grammars: apex, html, xml, dart, clojure → load rate 92.4% → 93.4%

Adds five search-mode grammars via the established 6-anchor pattern (Language enum + Display, parser arm, semgrep_compat map_language, scanner detect_language, registry_coverage, Cargo.toml, generic_mode/gen_rules_ts exhaustive matches). All on tree-sitter 0.25 — no downgrades or shims.

Lang Crate Note
html tree-sitter-html@0.23 LANGUAGE const
xml tree-sitter-xml@0.7 LANGUAGE_XML const
dart tree-sitter-dart@0.2 LANGUAGE const
clojure tree-sitter-clojure-orchard@0.2.5 tree-sitter-language 0.1
apex tree-sitter-sfapex@3.0.0 the legacy tree-sitter-apex@1.0.0 was rejected (links tree-sitter ~0.20, ABI-incompatible); sfapex uses the modern apex::LANGUAGE const

Results (independently re-measured)

before after
Load rate 92.4% (1982) 93.4% (2002)

+20 rules. The search-mode unsupported language buckets for all five → 0. Only mode: taint (unsupported language: apex) ×3 remains (no apex taint engine — out of scope).

Verification (re-run on the branch)

  • registry_coverage → 93.4% ✓
  • both dogfood scans exit 0 · cargo test 842+ lib, 0 failed · clippy -D warnings clean · fmt --check clean · baseline untouched ✓
  • Per-grammar tests: each parses a real fixture error-free + a pattern:/pattern-regex: search rule LOADS and matches via parse_semgrep_str.
  • One small engine touch: taught the AST pattern unwrapper about clojure-orchard's source top node so Clojure pattern: rules descend correctly (distinctive node name, full suite green).

Summary by CodeRabbit

  • New Features

    • Added support for Apex, Clojure, HTML, XML, and Dart language analysis.
    • Enhanced language detection to recognize additional file types.
  • Improvements

    • Increased rule registry coverage to 93.4% (from 92.4%), with 2,002 rules successfully loaded.

Wire five new search-mode tree-sitter grammars into the Semgrep-compat
loader so registry rules previously skipped as `unsupported language`
now load:

- apex   (tree-sitter-sfapex 3.0.0, apex::LANGUAGE)
- clojure (tree-sitter-clojure-orchard 0.2.5)
- html   (tree-sitter-html 0.23)
- xml    (tree-sitter-xml 0.7, LANGUAGE_XML)
- dart   (tree-sitter-dart 0.2)

All six anchors wired per language (Language enum + Display, parser
match arm, map_language, detect_language + comment markers,
registry_coverage language_supported, generic/regex ALL_LANGUAGES
fan-out, gen_rules_ts exhaustive matches, Cargo dep).

Also teach the AST pattern unwrapper about clojure-orchard's `source`
top-level node so `pattern:` rules match Clojure forms.

Registry load rate: 92.4% (1982) -> 93.4% (2002). search-mode
apex/clojure/html/xml/dart unsupported-language buckets all hit 0;
remaining apex skips (3) are taint-mode only.

Per-grammar tests: parser fixture parse + a pattern/pattern-regex
search rule that loads and matches.
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds end-to-end support for five new languages—Apex, Clojure, HTML, XML, and Dart—by introducing the corresponding tree-sitter crate dependencies, Language enum variants, parser bindings, file-extension detection, Semgrep-compat rule routing, generic/regex-mode fan-out, registry-coverage classification, TypeScript codegen metadata, and updated parity documentation.

Changes

Language Expansion: Apex, Clojure, HTML, XML, Dart

Layer / File(s) Summary
Language enum variants and crate dependencies
Cargo.toml, src/lib.rs
Adds tree-sitter-html, tree-sitter-xml, tree-sitter-dart, tree-sitter-clojure-orchard, and tree-sitter-sfapex as dependencies, and adds Apex, Clojure, Html, Xml, Dart variants to the Language enum with lowercase Display arms.
Parser bindings and scanner extension/comment detection
src/engine/parser.rs, src/engine/scanner.rs
Wires new variants into parse_source_for_path via tree-sitter language objects; maps cls/triggerApex and dartDart in detect_language; registers ////* comment markers for Apex and Dart; adds parser round-trip unit tests for all five languages.
Semgrep-compat language mapping and generic/regex-mode fan-out
src/rules/semgrep_compat.rs, src/rules/generic_mode.rs
Extends map_language to accept Semgrep language strings for all five languages (including Clojure aliases and HTML aliases); adds all five to REGEX_MODE_ALL_LANGUAGES and to ALL_LANGUAGES for generic-mode fan-out; adds per-language integration tests covering pattern and regex rule matching.
Registry-coverage classifier and TypeScript codegen
src/bin/registry_coverage.rs, src/bin/gen_rules_ts.rs
Extends language_supported to classify the five languages as supported; updates Apex/COBOL classifier tests; adds slug, display_name, and array_name mappings for all five languages in the TypeScript rule-group codegen.
Parity documentation update
docs/parity/registry-coverage.md
Regenerates registry-coverage stats reflecting the 93.4% load rate, updated skip-reason histogram, shortened missing-grammar gap list, and revised per-language rows for apex, clojure, html, xml, and dart.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • 0sec-labs/foxguard#509: Extends the same language-support plumbing (Language enum, parse_source_for_path, detect_language, semgrep_compat, generic_mode, registry_coverage) for Solidity using the identical pattern.
  • 0sec-labs/foxguard#510: Applies the same full-stack language integration pattern to add YAML support across the same set of files.
  • 0sec-labs/foxguard#512: Extends the same tree-sitter grammar pipeline and registry_coverage supported-language predicate for a different set of languages using the same structural approach.

Poem

🐇 Five new tongues have joined the warren today,
Apex, Clojure, HTML, XML, Dart in the fray!
The tree-sitters parse every leaf and node,
Semgrep rules now walk a wider road.
93.4%—the coverage climbs up high,
Hop hop hooray as more languages fly! 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding five new search-mode tree-sitter grammars (apex, html, xml, dart, clojure) and the resulting improvement to registry load rate (93.4%).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/grammars-apex-html-xml-dart-clojure

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/parity/registry-coverage.md`:
- Around line 49-53: The section heading in the registry-coverage.md file is
labeled as "missing language grammars" but the table entries (generic mode,
mode: taint with unsupported languages) represent mode and language support gaps
rather than strictly grammar gaps. Retitle or reword the section heading to
accurately reflect that it covers mode and language support gaps, not just
grammar gaps, so the content aligns with the heading and prevents
misrepresentation during prioritization.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 5105f49a-66e7-46d2-a3f3-a01eeb961772

📥 Commits

Reviewing files that changed from the base of the PR and between 38ed1f2 and df74e19.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • Cargo.toml
  • docs/parity/registry-coverage.md
  • src/bin/gen_rules_ts.rs
  • src/bin/registry_coverage.rs
  • src/engine/parser.rs
  • src/engine/scanner.rs
  • src/lib.rs
  • src/rules/generic_mode.rs
  • src/rules/semgrep_compat.rs

Comment on lines +49 to +53
| 1 | `generic mode (languages: [generic])` | 7 |
| 2 | `mode: taint (unsupported language: apex)` | 3 |
| 3 | `mode: taint (unsupported language: swift)` | 1 |

Missing-grammar gaps account for **31 rules** (1.4% of all rules).
Missing-grammar gaps account for **11 rules** (0.5% of all rules).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Section labeling is now misleading for these entries.

This table is under “missing language grammars,” but the listed reasons are generic-mode and taint-mode support gaps, not strictly grammar gaps. Please retitle/reword this section so prioritization isn’t skewed.

Suggested doc tweak
-## Priority order — missing language grammars
+## Priority order — remaining language-capability gaps

-Missing-grammar gaps account for **11 rules** (0.5% of all rules).
+These language-capability gaps account for **11 rules** (0.5% of all rules).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/parity/registry-coverage.md` around lines 49 - 53, The section heading
in the registry-coverage.md file is labeled as "missing language grammars" but
the table entries (generic mode, mode: taint with unsupported languages)
represent mode and language support gaps rather than strictly grammar gaps.
Retitle or reword the section heading to accurately reflect that it covers mode
and language support gaps, not just grammar gaps, so the content aligns with the
heading and prevents misrepresentation during prioritization.

@peaktwilight peaktwilight merged commit db5f71e into main Jun 18, 2026
19 checks passed
@peaktwilight peaktwilight deleted the feat/grammars-apex-html-xml-dart-clojure branch June 18, 2026 16:54
Comment thread src/engine/parser.rs
fn parses_apex_without_errors() {
let source =
"public class Hello {\n public void greet() {\n String x = 'hi';\n }\n}\n";
let tree = parse_path(source, Language::Apex, Path::new("Hello.cls"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

Comment thread src/engine/parser.rs
#[test]
fn parses_clojure_without_errors() {
let source = "(defn greet [name]\n (println \"hello\" name))\n";
let tree = parse_path(source, Language::Clojure, Path::new("core.clj"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

Comment thread src/engine/parser.rs
#[test]
fn parses_html_without_errors() {
let source = "<html>\n <body>\n <a href=\"/x\">link</a>\n </body>\n</html>\n";
let tree = parse_path(source, Language::Html, Path::new("index.html"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

Comment thread src/engine/parser.rs
#[test]
fn parses_xml_without_errors() {
let source = "<?xml version=\"1.0\"?>\n<root>\n <child id=\"1\">text</child>\n</root>\n";
let tree = parse_path(source, Language::Xml, Path::new("data.xml"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

Comment thread src/engine/parser.rs
#[test]
fn parses_dart_without_errors() {
let source = "void main() {\n print('hello');\n}\n";
let tree = parse_path(source, Language::Dart, Path::new("main.dart"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

severity: WARNING
languages: [apex]
"#;
let rules = parse_semgrep_str(yaml, "apex.yml")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

assert_eq!(rules.len(), 1, "expected one rule for apex language");

let source = "public class A {\n void f() {\n System.debug('x');\n }\n}\n";
let tree = parse_path(source, Language::Apex, Path::new("A.cls")).expect("Apex must parse");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

severity: WARNING
languages: [clojure]
"#;
let rules = parse_semgrep_str(yaml, "clojure.yml")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

assert_eq!(rules.len(), 1, "expected one rule for clojure language");

let source = "(defn f [x]\n (eval x))\n";
let tree = parse_path(source, Language::Clojure, Path::new("core.clj"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

severity: WARNING
languages: [html]
"#;
let rules = parse_semgrep_str(yaml, "html.yml").expect("html language rule must load");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

let source =
"<html>\n <body>\n <button onclick=\"go()\">x</button>\n </body>\n</html>\n";
let tree =
parse_path(source, Language::Html, Path::new("index.html")).expect("HTML must parse");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

severity: WARNING
languages: [xml]
"#;
let rules = parse_semgrep_str(yaml, "xml.yml").expect("xml language rule must load");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

let source =
"<?xml version=\"1.0\"?>\n<!DOCTYPE root>\n<root>\n <child>text</child>\n</root>\n";
let tree =
parse_path(source, Language::Xml, Path::new("data.xml")).expect("XML must parse");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

severity: WARNING
languages: [dart]
"#;
let rules = parse_semgrep_str(yaml, "dart.yml").expect("dart language rule must load");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match


let source = "void main() {\n print('hello');\n}\n";
let tree =
parse_path(source, Language::Dart, Path::new("main.dart")).expect("Dart must parse");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foxguard · MEDIUM · rs/no-unwrap-in-lib (CWE-248)

.expect() can panic at runtime — use proper error handling with ? or match

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant