[Idea]: v1.2.1 Roadmap: Improving detection from 37% to 80%+ — help wanted #29
nagasatish007
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Your Idea
Context
We published honest red team benchmark results for v1.2.0 (BENCHMARKS.md):
The high precision (85.7%) means we rarely block legitimate inputs. The low recall means we're missing attacks — specifically because our regex pattern library doesn't cover enough attack variants.
What we're adding in v1.2.1
We analyzed every missed probe and identified 8 new pattern categories:
Design constraint
All patterns use conjunction matching — they require two signals to co-occur (e.g., identity assignment AND unrestricted behavior language). This prevents false positives on legitimate messages like "please ignore my previous email" or "can you explain how prompt injection works?"
How you can help
npx ts-node benchmarks/run-real-benchmark.tsand share results.Timeline
v1.2.1 targeting release within 2 weeks. Pattern PRs welcome — each merged pattern gets you listed as a contributor.
Category
New Feature
What problem does this solve?
TealTiger v1.2.0 catches only 37% of adversarial probes (Garak benchmark) and has 40% recall on PINT. Analysis of the 44 missed probes reveals entire attack categories with zero coverage: persona jailbreaks, emotional manipulation, authority impersonation, data extraction requests, and several encoding schemes. This means agents protected by TealTiger are still vulnerable to well-known attack patterns.
Potential Impact
Anyone using TealTiger to protect AI agents in production. After this improvement:
Contributors can help by submitting attack patterns that bypass current detection, or by proposing regex patterns for specific attack classes.
Implementation Thoughts
Adding 8 new pattern categories to PromptInjectionGuardrail using conjunction matching (require 2+ signals to co-occur, preventing false positives):
All regex-based, deterministic, under 5ms evaluation. No ML inference added.
Examples or References
Beta Was this translation helpful? Give feedback.
All reactions