-
Notifications
You must be signed in to change notification settings - Fork 0
docs: audit all product surfaces and reposition for market #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
bf54a5e
9943f86
518285f
d61bb6b
6eeb02e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -10,7 +10,11 @@ | |||||||||||||||||||||
| [](https://www.python.org/downloads/) | ||||||||||||||||||||||
| [](https://github.com/astral-sh/ruff) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Deterministic input sanitization for untrusted text. Zero dependencies. Legitimate Unicode preserved by design. | ||||||||||||||||||||||
| Deterministic input sanitization for untrusted text — invisible characters, homoglyphs, and encoding tricks, handled before your code sees them. Zero dependencies, no ML. Legitimate Unicode preserved by design. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
| pip install navi-sanitize | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| **[Documentation](https://project-navi.github.io/navi-sanitize/)** · [Getting Started](https://project-navi.github.io/navi-sanitize/getting-started/quickstart/) · [API Reference](https://project-navi.github.io/navi-sanitize/reference/api/) · [Threat Model](https://project-navi.github.io/navi-sanitize/explanation/threat-model/) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
|
@@ -34,19 +38,21 @@ Opt-in utilities for deeper analysis: `decode_evasion()` peels nested URL/HTML/h | |||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Why This Matters | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Untrusted text contains invisible attacks: homoglyph substitution, zero-width characters, null bytes, fullwidth encoding, template/prompt injection delimiters. These bypass validation, poison templates, and fool humans. | ||||||||||||||||||||||
| Untrusted text contains invisible attacks: homoglyph substitution, zero-width characters, null bytes, fullwidth encoding, template/prompt injection delimiters. These bypass validation, poison templates, and fool humans. Framework validators handle format and type — they don't handle Unicode deception. That's what this library is for. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| navi-sanitize fixes the text before it reaches your application. It doesn't detect attacks — it removes them. Implements the NFKC + zero-width + control character pipeline recommended by the [OWASP LLM Prompt Injection Prevention Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/LLM_Prompt_Injection_Prevention_Cheat_Sheet.html). | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| navi-sanitize fixes the text before it reaches your application. It doesn't detect attacks — it removes them. | ||||||||||||||||||||||
| **LLM prompt pipelines** — Character-level attacks bypass LLM guardrails at [64–67% success rates](https://arxiv.org/html/2504.11168v1). Invisible Unicode encodes instructions tokenizers read but humans can't see. Homoglyphs bypass keyword filters. Sanitize before the model sees it. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| **LLM prompt pipelines** — User input flows into system prompts, RAG context, and tool calls. Invisible Unicode (tag block characters, bidi overrides) encodes instructions that tokenizers read but humans can't see. Homoglyphs bypass keyword filters. navi-sanitize strips these vectors before text reaches the model, and the pluggable escaper lets you add vendor-specific prompt escaping on top. | ||||||||||||||||||||||
| **Web applications** — A single `clean(user_input, escaper=jinja2_escaper)` call handles homoglyph-disguised SSTI payloads like `{{ cоnfig }}` (Cyrillic `о`) that naive escaping misses. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| **Web applications** — Jinja2 SSTI, path traversal, and fullwidth encoding bypasses are well-known but tedious to cover manually. A single `clean(user_input, escaper=jinja2_escaper)` call handles homoglyph-disguised payloads like `{{ cоnfig }}` (Cyrillic `о`) that naive escaping misses. | ||||||||||||||||||||||
| **Identity and anti-phishing** — `pаypal.com` (Cyrillic `а`) renders identically to `paypal.com`. The only maintained Python homoglyph replacement library — both [confusable_homoglyphs](https://github.com/vhf/confusable_homoglyphs) and [homoglyphs](https://github.com/life4/homoglyphs) are archived. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| **Identity and anti-phishing** — `pаypal.com` (Cyrillic `а`) renders identically to `paypal.com` in most fonts. Homoglyph replacement normalizes display names, URLs, and email addresses to catch spoofing that visual inspection misses. | ||||||||||||||||||||||
| **Log analysis** — Bidi overrides and zero-width chars hide IOCs from analysts. Sanitize on ingest so search matches reality. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| **Log analysis and SIEM** — Attackers embed bidi overrides and zero-width characters in log entries to hide indicators of compromise from analysts and pattern-matching tools. Sanitizing log data on ingest ensures what you search is what's actually there. | ||||||||||||||||||||||
| **Config ingestion** — Null bytes truncate C-extension processing, zero-width chars break key matching. `walk(parsed_config)` sanitizes every string in a nested structure in one call. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| **Config and data ingestion** — YAML, TOML, and JSON parsed from untrusted sources can carry null bytes that truncate C-extension processing, zero-width characters that break key matching, and homoglyphs that create near-duplicate keys. `walk(parsed_config)` sanitizes every string in a nested structure in one call. | ||||||||||||||||||||||
| These aren't theoretical risks — [CVE-2024-43093](https://nvd.nist.gov/vuln/detail/CVE-2024-43093) was an actively exploited Android zero-day using the exact fullwidth character bypass this pipeline prevents. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## How It Compares | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
|
@@ -59,10 +65,12 @@ navi-sanitize is the only library that combines invisible character stripping, h | |||||||||||||||||||||
| | **Homoglyphs** | Replaces 66 curated pairs | Transliterates all non-ASCII | Detects only (no replace) | No | No | | ||||||||||||||||||||||
| | **NFKC** | Yes | No | No | NFC (NFKC optional) | No | | ||||||||||||||||||||||
| | **Null bytes** | Yes | No | No | No | No | | ||||||||||||||||||||||
| | **Preserves Unicode** | Yes (CJK, Arabic, emoji intact) | No (destroys all non-ASCII) | Yes | Yes | Yes | | ||||||||||||||||||||||
| | **Preserves Unicode** | Yes (CJK, Arabic, emoji¹ intact) | No (destroys all non-ASCII) | Yes | Yes | Yes | | ||||||||||||||||||||||
| | **Pluggable escaper** | Yes | No | No | No | N/A (HTML-specific) | | ||||||||||||||||||||||
| | **Dependencies** | Zero | Zero | Zero | wcwidth | C ext / Rust ext | | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ¹ ZWJ (U+200D) is stripped as a zero-width character, which decomposes ZWJ emoji sequences (e.g. family emoji) into individual emoji. Single emoji are unaffected. Bidi formatting marks (U+061C, U+200E/F, etc.) used in Arabic/Hebrew are also stripped — correct rendering may require re-adding directional marks downstream. | ||||||||||||||||||||||
|
Comment on lines
+68
to
+72
|
||||||||||||||||||||||
| | **Preserves Unicode** | Yes (CJK, Arabic, emoji¹ intact) | No (destroys all non-ASCII) | Yes | Yes | Yes | | |
| | **Pluggable escaper** | Yes | No | No | No | N/A (HTML-specific) | | |
| | **Dependencies** | Zero | Zero | Zero | wcwidth | C ext / Rust ext | | |
| ¹ ZWJ (U+200D) is stripped as a zero-width character, which decomposes ZWJ emoji sequences (e.g. family emoji) into individual emoji. Single emoji are unaffected. Bidi formatting marks (U+061C, U+200E/F, etc.) used in Arabic/Hebrew are also stripped — correct rendering may require re-adding directional marks downstream. | |
| | **Preserves Unicode** | Mostly (CJK and Arabic letters preserved; emoji¹ preserved except ZWJ sequences) | No (destroys all non-ASCII) | Yes | Yes | Yes | | |
| | **Pluggable escaper** | Yes | No | No | No | N/A (HTML-specific) | | |
| | **Dependencies** | Zero | Zero | Zero | wcwidth | C ext / Rust ext | | |
| ¹ ZWJ (U+200D) is stripped as a zero-width character, which decomposes ZWJ emoji sequences (e.g. family emoji) into individual emoji. Single emoji are unaffected. Bidi formatting marks (U+061C, U+200E, U+200F, etc.) used in Arabic/Hebrew are also stripped — correct rendering may require re-adding directional marks downstream. |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -23,7 +23,9 @@ navi-sanitize is the only library that combines invisible character stripping, h | |||||
|
|
||||||
| **Why they're different:** Transliteration destroys content. Unidecode turns Chinese characters into pinyin, Cyrillic sentences into romanized gibberish, and Arabic into Latin approximations. It's designed for slug generation, not security. | ||||||
|
|
||||||
| navi-sanitize normalizes only the 66 highest-risk Latin lookalikes and leaves legitimate Unicode intact. CJK, Arabic, emoji, and non-confusable Cyrillic pass through unchanged. | ||||||
| navi-sanitize normalizes only the 66 highest-risk Latin lookalikes and leaves legitimate Unicode intact. CJK, Arabic, emoji,¹ and non-confusable Cyrillic pass through unchanged. | ||||||
|
||||||
| navi-sanitize normalizes only the 66 highest-risk Latin lookalikes and leaves legitimate Unicode intact. CJK, Arabic, emoji,¹ and non-confusable Cyrillic pass through unchanged. | |
| navi-sanitize normalizes only the 66 highest-risk Latin lookalikes and leaves legitimate Unicode intact. CJK, Arabic, single emoji,¹ and non-confusable Cyrillic pass through unchanged. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,25 +1,25 @@ | ||
| # Performance | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔵 LOW: Benchmark updates: hardware, Python version, and numbers clarifiedConfidence: 96% Benchmarks now specify hardware details (AMD Ryzen 9 9950X, Python 3.13), clarify expected variance, and report more realistic ops/sec. This transparency improves reproducibility. Suggestion: Continue to benchmark and document performance in context for user clarity. — Modernizing benchmark methodology is a step forward. Numbers are harder to fudge when they're precise. |
||
|
|
||
| Benchmarks measured on Python 3.12, single thread. Run via `uv run pytest tests/test_benchmark.py -v`. | ||
| Benchmarks measured on Python 3.13, single thread, AMD Ryzen 9 9950X. Run via `uv run pytest tests/test_benchmark.py -v`. Numbers are representative --- expect ±20% on different hardware; CI runners are typically 2--3x slower. | ||
|
|
||
| ## Benchmark Results | ||
|
|
||
| ### `clean()` --- Per-String Cost | ||
|
|
||
| | Scenario | Mean | Ops/sec | Description | | ||
| |----------|------|---------|-------------| | ||
| | Short, clean text (no-op) | 2.8 us | 358K | ~38 chars, no stages fire | | ||
| | Short, hostile (all stages) | 67 us | 15K | ~27 chars with homoglyphs, null bytes, zero-width, template syntax | | ||
| | 13KB clean text | 810 us | 1.2K | Large clean input throughput | | ||
| | 10KB hostile text | 449 us | 2.2K | Large hostile input with repeated attack patterns | | ||
| | 100KB hostile payload | 5.7 ms | 176 | Stress test payload | | ||
| | Short, clean text (no-op) | 1.1 µs | 905K | ~38 chars, no stages fire | | ||
| | Short, hostile (all stages) | 21 µs | 48K | ~27 chars with homoglyphs, null bytes, zero-width, template syntax | | ||
| | 13KB clean text | 292 µs | 3.4K | Large clean input throughput | | ||
| | 10KB hostile text | 305 µs | 3.3K | Large hostile input with repeated attack patterns | | ||
| | 100KB hostile payload | 3.5 ms | 286 | Stress test payload | | ||
|
|
||
| ### `walk()` --- Recursive Structure Cost | ||
|
|
||
| | Scenario | Mean | Ops/sec | Description | | ||
| |----------|------|---------|-------------| | ||
| | 100-item nested dict, clean | 537 us | 1.9K | `deepcopy` + traversal overhead, no stages fire | | ||
| | 100-item nested dict, hostile | 6.9 ms | 144 | `deepcopy` + full pipeline on every string | | ||
| | 100-item nested dict, clean | 311 µs | 3.2K | Iterative copy + traversal overhead, no stages fire | | ||
| | 100-item nested dict, hostile | 2.5 ms | 408 | Iterative copy + full pipeline on every string | | ||
|
|
||
| ## When to Use `clean()` vs `walk()` | ||
|
|
||
|
|
@@ -31,7 +31,7 @@ Benchmarks measured on Python 3.12, single thread. Run via `uv run pytest tests/ | |
| | Nested config from untrusted source | `walk()` | | ||
| | Hot path, single known string | `clean()` | | ||
|
|
||
| `walk()` adds `deepcopy` overhead to ensure the original data is never modified. If you're already working with a copy or don't need immutability, you can call `clean()` on individual strings for better performance. | ||
| `walk()` adds iterative copy overhead to ensure the original data is never modified. If you're already working with a copy or don't need immutability, you can call `clean()` on individual strings for better performance. | ||
|
|
||
| ## Performance Characteristics by Stage | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -5,7 +5,9 @@ navi-sanitize is a deterministic text sanitization library. It transforms untrus | |||||||||||||
| ## Design Philosophy | ||||||||||||||
|
|
||||||||||||||
| 1. **Deterministic** --- same input, same output, every time. No ML models, no heuristics, no confidence scores. | ||||||||||||||
| 2. **Legitimate Unicode preserved** --- CJK, Arabic, Hebrew, emoji, and non-confusable text pass through unchanged. A string that passes through unmodified was already clean. | ||||||||||||||
| 2. **Legitimate Unicode preserved** --- CJK, Arabic, Hebrew, emoji,¹ and non-confusable text pass through unchanged. A string that passes through unmodified was already clean. | ||||||||||||||
|
|
||||||||||||||
| ¹ ZWJ (U+200D) is stripped as a zero-width character, decomposing ZWJ emoji sequences into individual emoji. Bidi formatting marks (U+061C, U+200E/F, etc.) are also stripped — see [Stripping Arabic Letter Mark](#stripping-arabic-letter-mark-and-mongolian-fvs) below. | ||||||||||||||
|
Comment on lines
+8
to
+10
|
||||||||||||||
| 2. **Legitimate Unicode preserved** --- CJK, Arabic, Hebrew, emoji,¹ and non-confusable text pass through unchanged. A string that passes through unmodified was already clean. | |
| ¹ ZWJ (U+200D) is stripped as a zero-width character, decomposing ZWJ emoji sequences into individual emoji. Bidi formatting marks (U+061C, U+200E/F, etc.) are also stripped — see [Stripping Arabic Letter Mark](#stripping-arabic-letter-mark-and-mongolian-fvs) below. | |
| 2. **Legitimate Unicode preserved** --- CJK, Arabic and Hebrew text (except stripped bidi formatting marks), emoji,¹ and non-confusable text pass through unchanged. A string that passes through unmodified was already clean. | |
| ¹ ZWJ (U+200D) is stripped as a zero-width character, decomposing ZWJ emoji sequences into individual emoji. Bidi formatting marks (U+061C, U+200E, U+200F, etc.) are also stripped — see [Stripping Arabic Letter Mark](#stripping-arabic-letter-mark-and-mongolian-fvs) below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔵 LOW: Documentation repositioned: install command and purpose statement now immediately visible
Confidence: 98%
Moving the install command and improving the opening description aligns the README with common OSS best practices and makes onboarding easier for newcomers.
Suggestion: No action required; this is a positive example for future documentation.
— README onboarding sequence now matches what most people actually want to find. ...no arguments here.