Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 10 additions & 7 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,15 +43,18 @@ pre-commit run --all-files

## CI

GitHub Actions runs on push to `main` and all PRs. Five parallel jobs:
GitHub Actions runs on push to `main` and all PRs. Four parallel required checks, then downstream gates:

- **lint** — `ruff check` + `ruff format --check`
- **typecheck** — `mypy --strict src/navi_sanitize/`
- **test** — pytest across Python 3.12 + 3.13, `--benchmark-disable`
- **test** — pytest across Python 3.12 + 3.13, `--benchmark-disable` (matrix → aggregator)
- **security** — `pip-audit` dependency vulnerability scan
- **build** — gates on all four above; builds wheel, smoke-tests public API, uploads artifact
- **quality-gate** — gates on all four above (org ruleset required check)
- **build** — gates on all four required checks; builds wheel, smoke-tests imports (`clean`, `walk`, `jinja2_escaper`, `path_escaper`), uploads artifact

Additional security workflows: Semgrep SAST, CodeQL (`python` + `actions`), OpenSSF Scorecard.
Additional security workflows: Semgrep SAST, CodeQL (`python` + `actions` via org GHAS), OpenSSF Scorecard.

**Fuzz testing** (`.github/workflows/fuzz.yml`) — Atheris fuzzing of `fuzz_clean` and `fuzz_walk` targets, runs on push/PR and weekly schedule (Wednesday 03:00 UTC). Uploads crash artifacts on failure.

Benchmarks run via manual dispatch only (`.github/workflows/benchmark.yml`).

Expand All @@ -68,17 +71,17 @@ Eight exports: `clean(text, *, escaper=None) -> str`, `walk(data, *, escaper=Non
Six stages in strict order — reordering breaks security:

1. **Null byte removal** — strip `\x00` (prevents C-extension truncation)
2. **Invisible character stripping** — single compiled regex covering zero-width chars, format/control chars, variation selectors, Unicode Tag block (`U+E0000`-`U+E007F`), and bidi overrides
2. **Invisible character stripping** — single compiled regex covering 492 chars across 9 categories: zero-width, format/control, variation selectors, variation selector supplement, Mongolian FVS, Unicode Tag block (`U+E0000`-`U+E007F`), bidirectional controls, C0 controls, and C1 controls
3. **NFKC normalization** — collapses fullwidth ASCII and compatibility forms
4. **Homoglyph replacement** — NFD decomposition then character-by-character scan against 66-pair map in `_homoglyphs.py`
5. **Re-NFKC** (conditional) — re-normalize after homoglyph replacement to ensure idempotency
6. **Escaper** (optional) — pluggable `Callable[[str], str]` runs last

Each stage returns `(cleaned_string, count_or_flag)` — either an `int` count of removals/replacements or a `bool` changed flag. Stages have no side effects — the orchestrator logs.
Stages 1–5 each return `(cleaned_string, count)` where `count` is an `int` for removals, replacements, or normalization changes. Stage 6 (escaper) is a `Callable[[str], str]` that returns a bare `str`. Stages have no side effects — the orchestrator logs.

### Data files

- `_homoglyphs.py` — 66 pairs: Cyrillic, Greek, Armenian, Cherokee, and typographic lookalikes
- `_homoglyphs.py` — 66 pairs: Cyrillic, Greek, Armenian, Cherokee, Cyrillic Extended, Latin Extended, and typographic lookalikes
- `_invisible.py` — zero-width, format/control (soft hyphen, thin/hair space, line/paragraph separators, etc.), variation selectors, variation selector supplement, Mongolian FVS, Unicode Tag block, bidirectional controls, C0 controls, and C1 controls

### Escapers (`escapers/`)
Expand Down
52 changes: 27 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,11 @@
[![Python 3.12+](https://img.shields.io/badge/python-3.12%2B-blue.svg)](https://www.python.org/downloads/)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 LOW: Documentation repositioned: install command and purpose statement now immediately visible

Confidence: 98%

pip install navi-sanitize (now at top of README), hero line specifies differentiators

Moving the install command and improving the opening description aligns the README with common OSS best practices and makes onboarding easier for newcomers.

Suggestion: No action required; this is a positive example for future documentation.

— README onboarding sequence now matches what most people actually want to find. ...no arguments here.


Deterministic input sanitization for untrusted text. Zero dependencies. Legitimate Unicode preserved by design.
Deterministic input sanitization for untrusted text — invisible characters, homoglyphs, and encoding tricks, handled before your code sees them. Zero dependencies, no ML. Legitimate Unicode preserved by design.

```
pip install navi-sanitize
```

**[Documentation](https://project-navi.github.io/navi-sanitize/)** · [Getting Started](https://project-navi.github.io/navi-sanitize/getting-started/quickstart/) · [API Reference](https://project-navi.github.io/navi-sanitize/reference/api/) · [Threat Model](https://project-navi.github.io/navi-sanitize/explanation/threat-model/)

Expand All @@ -34,19 +38,21 @@ Opt-in utilities for deeper analysis: `decode_evasion()` peels nested URL/HTML/h

## Why This Matters

Untrusted text contains invisible attacks: homoglyph substitution, zero-width characters, null bytes, fullwidth encoding, template/prompt injection delimiters. These bypass validation, poison templates, and fool humans.
Untrusted text contains invisible attacks: homoglyph substitution, zero-width characters, null bytes, fullwidth encoding, template/prompt injection delimiters. These bypass validation, poison templates, and fool humans. Framework validators handle format and type — they don't handle Unicode deception. That's what this library is for.

navi-sanitize fixes the text before it reaches your application. It doesn't detect attacks — it removes them. Implements the NFKC + zero-width + control character pipeline recommended by the [OWASP LLM Prompt Injection Prevention Cheat Sheet](https://cheatsheetseries.owasp.org/cheatsheets/LLM_Prompt_Injection_Prevention_Cheat_Sheet.html).

navi-sanitize fixes the text before it reaches your application. It doesn't detect attacks — it removes them.
**LLM prompt pipelines** — Character-level attacks bypass LLM guardrails at [64–67% success rates](https://arxiv.org/html/2504.11168v1). Invisible Unicode encodes instructions tokenizers read but humans can't see. Homoglyphs bypass keyword filters. Sanitize before the model sees it.

**LLM prompt pipelines** — User input flows into system prompts, RAG context, and tool calls. Invisible Unicode (tag block characters, bidi overrides) encodes instructions that tokenizers read but humans can't see. Homoglyphs bypass keyword filters. navi-sanitize strips these vectors before text reaches the model, and the pluggable escaper lets you add vendor-specific prompt escaping on top.
**Web applications** — A single `clean(user_input, escaper=jinja2_escaper)` call handles homoglyph-disguised SSTI payloads like `{{ cоnfig }}` (Cyrillic `о`) that naive escaping misses.

**Web applications** — Jinja2 SSTI, path traversal, and fullwidth encoding bypasses are well-known but tedious to cover manually. A single `clean(user_input, escaper=jinja2_escaper)` call handles homoglyph-disguised payloads like `{{ cоnfig }}` (Cyrillic `о`) that naive escaping misses.
**Identity and anti-phishing** — `pаypal.com` (Cyrillic `а`) renders identically to `paypal.com`. The only maintained Python homoglyph replacement library — both [confusable_homoglyphs](https://github.com/vhf/confusable_homoglyphs) and [homoglyphs](https://github.com/life4/homoglyphs) are archived.

**Identity and anti-phishing** — `pаypal.com` (Cyrillic `а`) renders identically to `paypal.com` in most fonts. Homoglyph replacement normalizes display names, URLs, and email addresses to catch spoofing that visual inspection misses.
**Log analysis** — Bidi overrides and zero-width chars hide IOCs from analysts. Sanitize on ingest so search matches reality.

**Log analysis and SIEM** — Attackers embed bidi overrides and zero-width characters in log entries to hide indicators of compromise from analysts and pattern-matching tools. Sanitizing log data on ingest ensures what you search is what's actually there.
**Config ingestion** — Null bytes truncate C-extension processing, zero-width chars break key matching. `walk(parsed_config)` sanitizes every string in a nested structure in one call.

**Config and data ingestion**YAML, TOML, and JSON parsed from untrusted sources can carry null bytes that truncate C-extension processing, zero-width characters that break key matching, and homoglyphs that create near-duplicate keys. `walk(parsed_config)` sanitizes every string in a nested structure in one call.
These aren't theoretical risks[CVE-2024-43093](https://nvd.nist.gov/vuln/detail/CVE-2024-43093) was an actively exploited Android zero-day using the exact fullwidth character bypass this pipeline prevents.

## How It Compares

Expand All @@ -59,10 +65,12 @@ navi-sanitize is the only library that combines invisible character stripping, h
| **Homoglyphs** | Replaces 66 curated pairs | Transliterates all non-ASCII | Detects only (no replace) | No | No |
| **NFKC** | Yes | No | No | NFC (NFKC optional) | No |
| **Null bytes** | Yes | No | No | No | No |
| **Preserves Unicode** | Yes (CJK, Arabic, emoji intact) | No (destroys all non-ASCII) | Yes | Yes | Yes |
| **Preserves Unicode** | Yes (CJK, Arabic, emoji¹ intact) | No (destroys all non-ASCII) | Yes | Yes | Yes |
| **Pluggable escaper** | Yes | No | No | No | N/A (HTML-specific) |
| **Dependencies** | Zero | Zero | Zero | wcwidth | C ext / Rust ext |

¹ ZWJ (U+200D) is stripped as a zero-width character, which decomposes ZWJ emoji sequences (e.g. family emoji) into individual emoji. Single emoji are unaffected. Bidi formatting marks (U+061C, U+200E/F, etc.) used in Arabic/Hebrew are also stripped — correct rendering may require re-adding directional marks downstream.
Comment on lines +68 to +72
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comparison table says “Preserves Unicode … Arabic … intact” but the new footnote states bidi formatting marks used in Arabic/Hebrew are stripped, which can affect rendering. Also, “U+200E/F” is ambiguous notation; consider spelling out both codepoints (U+200E, U+200F) or using a range (U+200E–U+200F). Updating the table wording and/or footnote marker placement would make the claim consistent and unambiguous.

Suggested change
| **Preserves Unicode** | Yes (CJK, Arabic, emoji¹ intact) | No (destroys all non-ASCII) | Yes | Yes | Yes |
| **Pluggable escaper** | Yes | No | No | No | N/A (HTML-specific) |
| **Dependencies** | Zero | Zero | Zero | wcwidth | C ext / Rust ext |
¹ ZWJ (U+200D) is stripped as a zero-width character, which decomposes ZWJ emoji sequences (e.g. family emoji) into individual emoji. Single emoji are unaffected. Bidi formatting marks (U+061C, U+200E/F, etc.) used in Arabic/Hebrew are also stripped — correct rendering may require re-adding directional marks downstream.
| **Preserves Unicode** | Mostly (CJK and Arabic letters preserved; emoji¹ preserved except ZWJ sequences) | No (destroys all non-ASCII) | Yes | Yes | Yes |
| **Pluggable escaper** | Yes | No | No | No | N/A (HTML-specific) |
| **Dependencies** | Zero | Zero | Zero | wcwidth | C ext / Rust ext |
¹ ZWJ (U+200D) is stripped as a zero-width character, which decomposes ZWJ emoji sequences (e.g. family emoji) into individual emoji. Single emoji are unaffected. Bidi formatting marks (U+061C, U+200E, U+200F, etc.) used in Arabic/Hebrew are also stripped — correct rendering may require re-adding directional marks downstream.

Copilot uses AI. Check for mistakes.

**Key differences:**

- **Unidecode / anyascii** transliterate *all* non-ASCII to Latin. They turn `"` into `"Zhong"` and Cyrillic sentences into gibberish. navi-sanitize normalizes only the 66 highest-risk lookalikes and leaves legitimate Unicode intact.
Expand All @@ -78,9 +86,9 @@ Every string passes through stages in order. Each stage returns clean output and
| Stage | What it does |
|-------|-------------|
| Null bytes | Strip `\x00` |
| Invisibles | Strip zero-width, Unicode Tag block, bidi controls |
| Invisibles | Strip zero-width, format/control, variation selectors, Unicode Tag block, bidi, C0/C1 |
| NFKC | Normalize fullwidth ASCII to standard ASCII |
| Homoglyphs | Replace Cyrillic/Greek lookalikes with Latin equivalents |
| Homoglyphs | Replace Cyrillic/Greek/Armenian/Cherokee/typographic lookalikes with Latin equivalents |
| Re-NFKC | Re-normalize after homoglyph replacement (ensures idempotency) |
| **Escaper** | Pluggable — you choose what to escape for |

Expand Down Expand Up @@ -145,12 +153,6 @@ template.render(**safe_context)

See [examples/](examples/) for runnable scripts covering LLM pipelines, FastAPI/Pydantic, and log sanitization.

## Install

```
pip install navi-sanitize
```

## Walk untrusted data structures

```python
Expand Down Expand Up @@ -215,17 +217,17 @@ clean("pаypal.com")

## Performance

Measured on Python 3.12, single thread. `clean()` is the per-string cost; `walk()` includes the iterative copy pass.
Measured on Python 3.13, single thread, AMD Ryzen 9 9950X. `clean()` is the per-string cost; `walk()` includes the iterative copy pass. Numbers are representative — expect ±20% on different hardware; CI runners are typically 2–3x slower.

| Scenario | Mean | Ops/sec |
|----------|------|---------|
| `clean()` — short, clean text (no-op) | 2.8 us | 358K |
| `clean()` — short, hostile (all stages fire) | 67 us | 15K |
| `clean()` — 13KB clean text | 810 us | 1.2K |
| `clean()` — 10KB hostile text | 449 us | 2.2K |
| `clean()` — 100KB hostile payload | 5.7 ms | 176 |
| `walk()` — 100-item nested dict, clean | 537 us | 1.9K |
| `walk()` — 100-item nested dict, hostile | 6.9 ms | 144 |
| `clean()` — short, clean text (no-op) | 1.1 µs | 905K |
| `clean()` — short, hostile (all stages fire) | 21 µs | 48K |
| `clean()` — 13KB clean text | 292 µs | 3.4K |
| `clean()` — 10KB hostile text | 305 µs | 3.3K |
| `clean()` — 100KB hostile payload | 3.5 ms | 286 |
| `walk()` — 100-item nested dict, clean | 311 µs | 3.2K |
| `walk()` — 100-item nested dict, hostile | 2.5 ms | 408 |

## License

Expand Down
6 changes: 4 additions & 2 deletions docs/explanation/comparison.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,9 @@ navi-sanitize is the only library that combines invisible character stripping, h

**Why they're different:** Transliteration destroys content. Unidecode turns Chinese characters into pinyin, Cyrillic sentences into romanized gibberish, and Arabic into Latin approximations. It's designed for slug generation, not security.

navi-sanitize normalizes only the 66 highest-risk Latin lookalikes and leaves legitimate Unicode intact. CJK, Arabic, emoji, and non-confusable Cyrillic pass through unchanged.
navi-sanitize normalizes only the 66 highest-risk Latin lookalikes and leaves legitimate Unicode intact. CJK, Arabic, emoji,¹ and non-confusable Cyrillic pass through unchanged.
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence claims “emoji … pass through unchanged” but the footnote clarifies ZWJ sequences are decomposed (and the library also strips some bidi/formatting marks). Consider tweaking the wording to avoid “unchanged” for emoji in general, or make the footnote marker placement reflect exactly what’s affected (e.g., “emoji sequences”). This will reduce confusion for readers comparing tools.

Suggested change
navi-sanitize normalizes only the 66 highest-risk Latin lookalikes and leaves legitimate Unicode intact. CJK, Arabic, emoji,¹ and non-confusable Cyrillic pass through unchanged.
navi-sanitize normalizes only the 66 highest-risk Latin lookalikes and leaves legitimate Unicode intact. CJK, Arabic, single emoji,¹ and non-confusable Cyrillic pass through unchanged.

Copilot uses AI. Check for mistakes.

¹ ZWJ (U+200D) is stripped as a zero-width character, which decomposes ZWJ emoji sequences (e.g. family emoji) into individual emoji. Single emoji are unaffected.

| Input | navi-sanitize | Unidecode |
|-------|--------------|-----------|
Expand Down Expand Up @@ -118,7 +120,7 @@ class UserInput(BaseModel):
The Unicode Consortium's `Confusables.txt` contains thousands of pairs across many scripts. navi-sanitize uses a curated 66-pair subset focused on:

1. **Highest visual similarity** --- characters that are pixel-identical in common fonts
2. **Most commonly weaponized** --- Cyrillic/Greek-to-Latin pairs used in phishing and filter bypass
2. **Most commonly weaponized** --- Cyrillic/Greek/Armenian/Cherokee/typographic-to-Latin pairs used in phishing and filter bypass
3. **Typographic normalization** --- smart quotes, em/en dashes, minus signs

This preserves legitimate Unicode while covering the attack surface that matters in practice.
Expand Down
18 changes: 9 additions & 9 deletions docs/explanation/performance.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
# Performance
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 LOW: Benchmark updates: hardware, Python version, and numbers clarified

Confidence: 96%

Benchmarks measured on Python 3.13, single thread, AMD Ryzen 9 9950X. ... expect ±20% on different hardware; CI runners are typically 2-3x slower.

Benchmarks now specify hardware details (AMD Ryzen 9 9950X, Python 3.13), clarify expected variance, and report more realistic ops/sec. This transparency improves reproducibility.

Suggestion: Continue to benchmark and document performance in context for user clarity.

— Modernizing benchmark methodology is a step forward. Numbers are harder to fudge when they're precise.


Benchmarks measured on Python 3.12, single thread. Run via `uv run pytest tests/test_benchmark.py -v`.
Benchmarks measured on Python 3.13, single thread, AMD Ryzen 9 9950X. Run via `uv run pytest tests/test_benchmark.py -v`. Numbers are representative --- expect ±20% on different hardware; CI runners are typically 2--3x slower.

## Benchmark Results

### `clean()` --- Per-String Cost

| Scenario | Mean | Ops/sec | Description |
|----------|------|---------|-------------|
| Short, clean text (no-op) | 2.8 us | 358K | ~38 chars, no stages fire |
| Short, hostile (all stages) | 67 us | 15K | ~27 chars with homoglyphs, null bytes, zero-width, template syntax |
| 13KB clean text | 810 us | 1.2K | Large clean input throughput |
| 10KB hostile text | 449 us | 2.2K | Large hostile input with repeated attack patterns |
| 100KB hostile payload | 5.7 ms | 176 | Stress test payload |
| Short, clean text (no-op) | 1.1 µs | 905K | ~38 chars, no stages fire |
| Short, hostile (all stages) | 21 µs | 48K | ~27 chars with homoglyphs, null bytes, zero-width, template syntax |
| 13KB clean text | 292 µs | 3.4K | Large clean input throughput |
| 10KB hostile text | 305 µs | 3.3K | Large hostile input with repeated attack patterns |
| 100KB hostile payload | 3.5 ms | 286 | Stress test payload |

### `walk()` --- Recursive Structure Cost

| Scenario | Mean | Ops/sec | Description |
|----------|------|---------|-------------|
| 100-item nested dict, clean | 537 us | 1.9K | `deepcopy` + traversal overhead, no stages fire |
| 100-item nested dict, hostile | 6.9 ms | 144 | `deepcopy` + full pipeline on every string |
| 100-item nested dict, clean | 311 µs | 3.2K | Iterative copy + traversal overhead, no stages fire |
| 100-item nested dict, hostile | 2.5 ms | 408 | Iterative copy + full pipeline on every string |

## When to Use `clean()` vs `walk()`

Expand All @@ -31,7 +31,7 @@ Benchmarks measured on Python 3.12, single thread. Run via `uv run pytest tests/
| Nested config from untrusted source | `walk()` |
| Hot path, single known string | `clean()` |

`walk()` adds `deepcopy` overhead to ensure the original data is never modified. If you're already working with a copy or don't need immutability, you can call `clean()` on individual strings for better performance.
`walk()` adds iterative copy overhead to ensure the original data is never modified. If you're already working with a copy or don't need immutability, you can call `clean()` on individual strings for better performance.

## Performance Characteristics by Stage

Expand Down
4 changes: 2 additions & 2 deletions docs/explanation/pipeline-architecture.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Pipeline Architecture

Every string passed to `clean()` flows through six stages in strict order. Each stage is a deterministic function that returns the cleaned string and a change indicator (a count for stages that strip or replace, a boolean for normalization). The pipeline orchestrator logs warnings when stages modify input.
Every string passed to `clean()` flows through six stages in strict order. Each of the five universal stages (1--5) is a deterministic function that returns the cleaned string and a change indicator (a count of affected codepoints). The escaper (stage 6, if provided) is a plain `str -> str` function. The pipeline orchestrator logs warnings when stages modify input.

## Data Flow

Expand All @@ -24,7 +24,7 @@ Input string
┌─────────────────────┐
│ 4. Homoglyph Replace│ Cyrillic/Greek → Latin
│ 4. Homoglyph Replace│ Cyrillic/Greek/Armenian/Cherokee/typographic → Latin
└─────────┬───────────┘
Expand Down
6 changes: 4 additions & 2 deletions docs/explanation/threat-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@ navi-sanitize is a deterministic text sanitization library. It transforms untrus
## Design Philosophy

1. **Deterministic** --- same input, same output, every time. No ML models, no heuristics, no confidence scores.
2. **Legitimate Unicode preserved** --- CJK, Arabic, Hebrew, emoji, and non-confusable text pass through unchanged. A string that passes through unmodified was already clean.
2. **Legitimate Unicode preserved** --- CJK, Arabic, Hebrew, emoji,¹ and non-confusable text pass through unchanged. A string that passes through unmodified was already clean.

¹ ZWJ (U+200D) is stripped as a zero-width character, decomposing ZWJ emoji sequences into individual emoji. Bidi formatting marks (U+061C, U+200E/F, etc.) are also stripped — see [Stripping Arabic Letter Mark](#stripping-arabic-letter-mark-and-mongolian-fvs) below.
Comment on lines +8 to +10
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says Arabic/Hebrew “pass through unchanged” but the footnote notes bidi formatting marks (used in Arabic/Hebrew) are stripped. Also, “U+200E/F” is ambiguous. Consider adjusting the main sentence (or adding footnote markers on Arabic/Hebrew too) and spelling out the exact bidi codepoints (e.g., U+200E, U+200F) so the “preserved” claim is accurate.

Suggested change
2. **Legitimate Unicode preserved** --- CJK, Arabic, Hebrew, emoji,¹ and non-confusable text pass through unchanged. A string that passes through unmodified was already clean.
¹ ZWJ (U+200D) is stripped as a zero-width character, decomposing ZWJ emoji sequences into individual emoji. Bidi formatting marks (U+061C, U+200E/F, etc.) are also stripped — see [Stripping Arabic Letter Mark](#stripping-arabic-letter-mark-and-mongolian-fvs) below.
2. **Legitimate Unicode preserved** --- CJK, Arabic and Hebrew text (except stripped bidi formatting marks), emoji,¹ and non-confusable text pass through unchanged. A string that passes through unmodified was already clean.
¹ ZWJ (U+200D) is stripped as a zero-width character, decomposing ZWJ emoji sequences into individual emoji. Bidi formatting marks (U+061C, U+200E, U+200F, etc.) are also stripped — see [Stripping Arabic Letter Mark](#stripping-arabic-letter-mark-and-mongolian-fvs) below.

Copilot uses AI. Check for mistakes.
3. **Always returns output** --- never throws on bad input (except `TypeError` for non-strings). Attackers can't cause denial of service by crafting inputs that error.
4. **Pluggable** --- the universal pipeline handles common vectors; escapers handle context-specific threats.

Expand Down Expand Up @@ -117,7 +119,7 @@ re-inserts them from a trusted source.

Characters like ᴀᴅᴍɪɴ (Latin Small Capitals, U+1D00--U+1D22) and ɑ (Latin Small Letter Alpha,
U+0251) are visually similar to standard Latin letters but are classified as Latin script by
Unicode. The homoglyph map targets cross-script confusables (Cyrillic, Greek, Armenian, Cherokee)
Unicode. The homoglyph map targets cross-script confusables (Cyrillic, Greek, Armenian, Cherokee, Latin Extended, and typographic)
where the script mismatch is the attack signal. Latin-to-Latin visual similarity is a different
threat model better served by `detect_scripts()` and `is_mixed_script()` --- or by application-level
character allowlisting for high-security contexts like username registration.
Expand Down
2 changes: 1 addition & 1 deletion docs/getting-started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ clean("pаypal.com")
Warnings include counts for traceability:
- `"Removed 2 null byte(s) from value"`
- `"Stripped 3 invisible character(s) from value"`
- `"Normalized fullwidth character(s) in value"`
- `"Normalized 1 fullwidth/compatibility character(s) in value"`
- `"Replaced 1 homoglyph(s) in value"`

To capture warnings programmatically:
Expand Down
Loading
Loading