Skip to content

docs: audit all product surfaces and reposition for market#33

Merged
Navi Bot (project-navi-bot) merged 5 commits into
mainfrom
docs/audit-and-positioning
Apr 5, 2026
Merged

docs: audit all product surfaces and reposition for market#33
Navi Bot (project-navi-bot) merged 5 commits into
mainfrom
docs/audit-and-positioning

Conversation

@Fieldnote-Echo
Copy link
Copy Markdown
Member

Summary

  • Full five-agent documentation audit across all product surfaces (README, CLAUDE.md, docs site, pipeline code)
  • Fixed 1 HIGH, 8 MEDIUM, 9 LOW accuracy issues across README, CLAUDE.md, and 5 docs files
  • Repositioned README and docs/index.md based on four-agent market research
  • Pipeline code: _normalize_nfkc() returns count instead of bool, type: ignore replaced with cast(T, ...)

What changed

Accuracy fixes:

  • Pipeline table: homoglyph scope, invisible char categories, stage return types
  • CI section: job count, fuzz workflow, smoke test scope
  • Benchmark numbers: rerun on local hardware (AMD Ryzen 9 9950X, Python 3.13)
  • ZWJ emoji decomposition caveat added to comparison table
  • NFKC warning now includes count per project convention

Positioning:

  • Hero line: mechanism specificity + "no ML" differentiator
  • Install command: moved to first screen (was line 150)
  • "Why This Matters": compressed to scannable bullets, LLM leads
  • OWASP LLM Prompt Injection Prevention Cheat Sheet alignment
  • "Only maintained homoglyph library" (both alternatives archived)
  • CVE-2024-43093 proof point (actively exploited fullwidth bypass)

Code:

  • _normalize_nfkc()tuple[str, int] (count of changed codepoints)
  • type: ignore[return-value]cast(T, ...) with explanatory comments
  • Zero type: ignore remaining in src/

Test plan

  • 405 tests pass (-W error, warnings-as-errors)
  • mypy strict clean (0 issues, 0 type: ignore in src/)
  • ruff check + format clean
  • All pre-commit hooks pass (gitleaks, trailing whitespace, ruff, mypy)
  • CI passes on GitHub
  • Docs site renders correctly (check after merge)

Five-agent documentation audit found and fixed:
- README pipeline table understated homoglyph scope (missing Armenian/Cherokee/typographic)
- README "emoji intact" claim lacked ZWJ decomposition caveat
- CLAUDE.md invisible char description listed 5 of 9 categories (omitted 304 of 492 chars)
- CLAUDE.md stage return type claim incorrect for escaper (str, not tuple)
- CLAUDE.md CI section said "five parallel jobs" (actually 7, only 4 parallel)
- CLAUDE.md omitted fuzz workflow entirely
- Benchmark numbers stale (from CI runner); rerun on local hardware
- type: ignore annotations replaced with typed cast()
- NFKC warning lacked count, violating project's own convention

Pipeline code change:
- _normalize_nfkc() returns tuple[str, int] (count) instead of tuple[str, bool]
- Warning message now includes count per project convention
- Both type: ignore[return-value] replaced with cast(T, ...) + explanatory comments

Positioning changes (informed by market research):
- Hero line: added mechanism specificity and "no ML" differentiator
- Install command moved to first screen (was buried at line 150)
- "Why This Matters" compressed to scannable bullets, LLM leads
- Added OWASP LLM Prompt Injection Prevention Cheat Sheet alignment
- Added "only maintained homoglyph library" positioning (both alternatives archived)
- Added CVE-2024-43093 proof point (actively exploited fullwidth bypass)
- docs/index.md: mirrored positioning, added OWASP and competitive bullets

All surfaces updated: README, CLAUDE.md, docs/index.md, pipeline-architecture,
comparison, threat-model, performance, quickstart.

405 tests pass. mypy strict clean. ruff clean.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!


def _normalize_nfkc(s: str) -> tuple[str, bool]:
"""NFKC normalize. Returns (cleaned, changed)."""
def _normalize_nfkc(s: str) -> tuple[str, int]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM: _normalize_nfkc() may miscount changed codepoints when string-level normalization changes more than one character

Confidence: 95%

count = sum(1 for c in s if unicodedata.normalize("NFKC", c) != c)
return normalized, max(count, 1)

The revised _normalize_nfkc() function attempts to count the number of codepoints changed by per-character normalization. However, performing unicodedata.normalize("NFKC", c) for each character and comparing it to c does not always accurately count the number of changed codepoints. Some NFKC normalizations operate on substrings (e.g., character pairs), so the per-codepoint approach may undercount or misrepresent true changes.

Suggestion: Consider a more robust diffing mechanism for counting NFKC changes, such as using difflib to compare original and normalized strings, or explicitly document that this is an approximate count. Alternatively, clarify in both docstrings and warning logs that the count may be inexact.

— Counting NFKC-changed codepoints is tricky-per-char normalization can be misleading when composite sequences normalize as one. Document the approximation or consider stronger diff logic.

Comment thread README.md
@@ -10,7 +10,11 @@
[![Python 3.12+](https://img.shields.io/badge/python-3.12%2B-blue.svg)](https://www.python.org/downloads/)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 LOW: Documentation repositioned: install command and purpose statement now immediately visible

Confidence: 98%

pip install navi-sanitize (now at top of README), hero line specifies differentiators

Moving the install command and improving the opening description aligns the README with common OSS best practices and makes onboarding easier for newcomers.

Suggestion: No action required; this is a positive example for future documentation.

— README onboarding sequence now matches what most people actually want to find. ...no arguments here.

@@ -1,25 +1,25 @@
# Performance
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 LOW: Benchmark updates: hardware, Python version, and numbers clarified

Confidence: 96%

Benchmarks measured on Python 3.13, single thread, AMD Ryzen 9 9950X. ... expect ±20% on different hardware; CI runners are typically 2-3x slower.

Benchmarks now specify hardware details (AMD Ryzen 9 9950X, Python 3.13), clarify expected variance, and report more realistic ops/sec. This transparency improves reproducibility.

Suggestion: Continue to benchmark and document performance in context for user clarity.

— Modernizing benchmark methodology is a step forward. Numbers are harder to fudge when they're precise.

github-actions[bot]
github-actions Bot previously approved these changes Apr 5, 2026
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grippy approves — PASS (95/100)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 5, 2026

✅ Grippy Review — PASS

Score: 100/100 | Findings: 3

Delta: 3 new


Commit: 6eeb02e

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf54a5ea56

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread src/navi_sanitize/_pipeline.py Outdated
Comment on lines +49 to +50
count = sum(1 for c in s if unicodedata.normalize("NFKC", c) != c)
return normalized, max(count, 1) # at least 1 if string changed
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compute NFKC change counts from whole-string diffs

The new count logic under-reports normalization changes when NFKC effects depend on neighboring code points (for example, combining-mark sequences like "e\u0301e\u0301"), because it checks each character in isolation and then falls back to 1 via max(count, 1). In these cases the string changes but multiple characters are normalized, so the warning count is inaccurate; this breaks the new “count for traceability” behavior and can skew any log-based metrics built on that count.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR performs a broad documentation audit/repositioning across the README + docs site and updates the core sanitization pipeline’s NFKC stage to return a change count (and removes remaining type: ignore usage in src/).

Changes:

  • Update _normalize_nfkc() to return (normalized_text, change_count) and log NFKC normalization with a count.
  • Replace type: ignore[return-value] with cast(T, ...) in walk() return paths.
  • Rework README + docs landing pages and supporting docs for accuracy and market positioning (pipeline table, benchmarks, caveats, etc.).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/navi_sanitize/_pipeline.py NFKC stage now returns an int count; logging updated; typing uses cast instead of type: ignore.
README.md Repositioning, install moved up, pipeline table + benchmarks + additional caveats.
docs/index.md Landing page refresh + new feature bullets.
docs/getting-started/quickstart.md Logging section updated to reflect NFKC count-based warning.
docs/explanation/threat-model.md Add caveats around ZWJ/bidi stripping and homoglyph scope text updates.
docs/explanation/pipeline-architecture.md Update stage semantics/diagram text to reflect count-based indicators.
docs/explanation/performance.md Benchmark environment and results updated; walk() description updated.
docs/explanation/comparison.md Comparison positioning updates + ZWJ footnote + homoglyph scope update.
CLAUDE.md CI/pipeline documentation updates (but stage return-type text needs reconciliation).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/navi_sanitize/_pipeline.py Outdated
Comment on lines +44 to +50
def _normalize_nfkc(s: str) -> tuple[str, int]:
"""NFKC normalize. Returns (cleaned, count of changed codepoints)."""
normalized = unicodedata.normalize("NFKC", s)
return normalized, normalized != s
if normalized == s:
return s, 0
count = sum(1 for c in s if unicodedata.normalize("NFKC", c) != c)
return normalized, max(count, 1) # at least 1 if string changed
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_normalize_nfkc() claims to return a “count of changed codepoints”, but the current counting approach is only per-codepoint NFKC differences and then forces a minimum of 1 when the string changes due to cross-codepoint composition. This makes the count potentially misleading (e.g., a sequence that changes only when normalized as a whole will log 1 even though no individual codepoint differs in isolation). Consider either (a) redefining/documenting the count as an “approximate/non-zero change indicator” or (b) computing a count that better matches the wording (and log message).

Copilot uses AI. Check for mistakes.
Comment thread src/navi_sanitize/_pipeline.py Outdated
Comment on lines +44 to +50
def _normalize_nfkc(s: str) -> tuple[str, int]:
"""NFKC normalize. Returns (cleaned, count of changed codepoints)."""
normalized = unicodedata.normalize("NFKC", s)
return normalized, normalized != s
if normalized == s:
return s, 0
count = sum(1 for c in s if unicodedata.normalize("NFKC", c) != c)
return normalized, max(count, 1) # at least 1 if string changed
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_normalize_nfkc() now does an additional per-codepoint unicodedata.normalize() pass to compute a count when the string changes. Since _normalize_nfkc() is also used for the stage-5 “Re-NFKC” pass where the count isn’t logged/used, consider splitting into two helpers (normalize-only vs normalize+count) or adding an option to skip counting to avoid unnecessary O(n) work in that path.

Copilot uses AI. Check for mistakes.
Comment on lines +107 to +109
text, nfkc_count = _normalize_nfkc(text)
if nfkc_count:
logger.warning("Normalized %d fullwidth/compatibility character(s) in value", nfkc_count)
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR changes stage 3’s warning to include an NFKC normalization count, but the current test suite doesn’t appear to assert the count/value for this warning (unlike invisibles/homoglyph counts). Adding a focused test that checks the warning contains the expected count for a representative NFKC case would help prevent regressions in the new counting behavior.

Copilot uses AI. Check for mistakes.
Comment thread docs/getting-started/quickstart.md Outdated
- `"Removed 2 null byte(s) from value"`
- `"Stripped 3 invisible character(s) from value"`
- `"Normalized fullwidth character(s) in value"`
- `"Normalized {N} fullwidth/compatibility character(s) in value"`
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section lists example warning strings as literal messages, but the NFKC entry is shown as "Normalized {N} ..." rather than an example with an actual number. Since the real log message includes a concrete integer, consider either using a concrete example (e.g. "Normalized 1 ...") or explicitly stating that {N} is a placeholder to avoid confusion/copy-paste mismatch.

Suggested change
- `"Normalized {N} fullwidth/compatibility character(s) in value"`
- `"Normalized 1 fullwidth/compatibility character(s) in value"`

Copilot uses AI. Check for mistakes.
Comment thread CLAUDE.md Outdated
6. **Escaper** (optional) — pluggable `Callable[[str], str]` runs last

Each stage returns `(cleaned_string, count_or_flag)` — either an `int` count of removals/replacements or a `bool` changed flag. Stages have no side effects — the orchestrator logs.
Stages 1–5 each return `(cleaned_string, count_or_flag)` — either an `int` count of removals/replacements or a `bool` changed flag. Stage 6 (escaper) is a `Callable[[str], str]` that returns a bare `str`. Stages have no side effects — the orchestrator logs.
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still documents that stages 1–5 may return a bool “changed flag”, but _pipeline._normalize_nfkc() now returns an int count (and the other stages return counts as well). Please update this description to match the current implementation so it doesn’t mislead contributors about stage return types.

Suggested change
Stages 1–5 each return `(cleaned_string, count_or_flag)` — either an `int` count of removals/replacements or a `bool` changed flag. Stage 6 (escaper) is a `Callable[[str], str]` that returns a bare `str`. Stages have no side effects — the orchestrator logs.
Stages 1–5 each return `(cleaned_string, count)` where `count` is an `int` for removals, replacements, or normalization changes. Stage 6 (escaper) is a `Callable[[str], str]` that returns a bare `str`. Stages have no side effects — the orchestrator logs.

Copilot uses AI. Check for mistakes.
# Pipeline Architecture

Every string passed to `clean()` flows through six stages in strict order. Each stage is a deterministic function that returns the cleaned string and a change indicator (a count for stages that strip or replace, a boolean for normalization). The pipeline orchestrator logs warnings when stages modify input.
Every string passed to `clean()` flows through six stages in strict order. Each of the five universal stages (1--5) is a deterministic function that returns the cleaned string and a change indicator (a count of removals or replacements). The escaper (stage 6, if provided) is a plain `str -> str` function. The pipeline orchestrator logs warnings when stages modify input.
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intro says universal stages return “a count of removals or replacements”, but stage 3 (NFKC) is normalization (not removal/replacement) and the implementation now reports a count of normalized/changed codepoints. Consider tweaking this wording to include normalization (or more generally: “a count of affected codepoints/changes”) so it matches the actual stage semantics.

Suggested change
Every string passed to `clean()` flows through six stages in strict order. Each of the five universal stages (1--5) is a deterministic function that returns the cleaned string and a change indicator (a count of removals or replacements). The escaper (stage 6, if provided) is a plain `str -> str` function. The pipeline orchestrator logs warnings when stages modify input.
Every string passed to `clean()` flows through six stages in strict order. Each of the five universal stages (1--5) is a deterministic function that returns the cleaned string and a change indicator (a count of affected codepoints). The escaper (stage 6, if provided) is a plain `str -> str` function. The pipeline orchestrator logs warnings when stages modify input.

Copilot uses AI. Check for mistakes.
Comment thread docs/index.md Outdated
- **Only maintained option** --- both [confusable_homoglyphs](https://github.com/vhf/confusable_homoglyphs) and [homoglyphs](https://github.com/life4/homoglyphs) are archived; navi-sanitize is the only maintained Python library covering homoglyph replacement
- **Deterministic** --- same input always produces the same output; no probabilistic models, no heuristics
- **Zero dependencies** --- Python 3.12+ stdlib only
- **Zero dependencies** --- Python 3.12+ stdlib only; zero supply chain risk
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Zero dependencies … zero supply chain risk” is an absolute claim that isn’t strictly accurate (there’s still supply-chain risk in the package itself and its distribution channel; “zero third‑party dependency risk” would be closer). Consider rephrasing to avoid overstating the security property.

Suggested change
- **Zero dependencies** --- Python 3.12+ stdlib only; zero supply chain risk
- **Zero dependencies** --- Python 3.12+ stdlib only; no third-party dependency risk

Copilot uses AI. Check for mistakes.
NFKC count logic:
- Replace per-character normalization with whole-string positional diff
  (fixes combining-mark sequences like e+accent that compose as pairs)
- Add count=False fast path for stage-5 re-NFKC where count is discarded
- Add 2 tests asserting NFKC warning includes correct count

Doc fixes from Copilot suggestions:
- CLAUDE.md: stage return types now all described as int count (no bool)
- pipeline-architecture.md: "affected codepoints" instead of "removals or replacements"
- quickstart.md: concrete example "Normalized 1 ..." instead of "{N}" placeholder
- docs/index.md: "no third-party dependency risk" instead of absolute "zero supply chain risk"
github-actions[bot]
github-actions Bot previously approved these changes Apr 5, 2026
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grippy approves — PASS (100/100)

Greek U+03A5 + combining tilde triggers the count=False fast path:
stage 4 replaces U+03A5 -> Y, stage 5 re-NFKC composes Y+tilde -> U+1EF8.
github-actions[bot]
github-actions Bot previously approved these changes Apr 5, 2026
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grippy approves — PASS (100/100)

Stage 5 re-NFKC never changes the string because stage 3 NFKC already
handles compatibility decomposition and stage 4 NFC handles canonical
composition. The count=False optimization was unreachable dead code —
_normalize_nfkc always returns (s, 0) at the early equality check in
the stage 5 path, so the counting diff never runs there.

Removed: count parameter, max(n, 1) guard (n >= 1 is guaranteed when
normalized != s since equal-length unequal strings differ in at least
one position).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (3)

src/navi_sanitize/_pipeline.py:66

  • _normalize_nfkc()’s “affected codepoints” count is computed via positional comparison + length delta, which can significantly overcount when normalization changes length early in the string (e.g., ligatures like U+FB00 → "ff" will report 2 affected codepoints even though only 1 input codepoint changed). Consider either (a) switching to an edit-script based diff (e.g., SequenceMatcher opcodes over codepoints) to produce a more meaningful count, or (b) documenting/renaming this value as an approximate upper bound rather than a codepoint count.
    acute) compose to a single precomposed codepoint but each constituent
    appears unchanged when normalized in isolation.
    """
    normalized = unicodedata.normalize("NFKC", s)
    if normalized == s:
        return s, 0
    min_len = min(len(s), len(normalized))
    n = sum(1 for i in range(min_len) if s[i] != normalized[i])
    n += abs(len(s) - len(normalized))
    return normalized, n


def _replace_homoglyphs(s: str) -> tuple[str, int]:
    """Replace homoglyphs with Latin equivalents. Returns (cleaned, count).

    Decomposes to NFD first so that combining marks cannot hide

src/navi_sanitize/_pipeline.py:125

  • The Stage 3 warning message presents nfkc_count as an exact number of “fullwidth/compatibility character(s)”, but the implementation is explicitly approximate and may overcount for certain compatibility folds. Either adjust the log wording to indicate the count is approximate/upper-bound, or tighten the counting logic so the warning’s number matches what users will reasonably interpret as “characters affected”.
    text, glyph_count = _replace_homoglyphs(text)
    if glyph_count:
        logger.warning("Replaced %d homoglyph(s) in value", glyph_count)
        # Stage 5: Re-NFKC — homoglyph replacement can produce Latin chars that

src/navi_sanitize/_pipeline.py:168

  • walk() is typed as walk[T](data: T) -> T, but for str subclasses it returns a plain str (clean() always returns str). Using cast(T, ...) silences mypy but makes the public type signature unsound for callers relying on subclass preservation. Consider adding overloads (str -> str, dict/list -> same container type) or constraining the TypeVar so the signature reflects the actual runtime behavior.

    Uses a single iterative pass (no recursion, no deepcopy) so hostile
    nesting depth cannot cause stack overflow. Logs a warning when nesting
    exceeds *max_depth* but continues sanitizing. Cyclic references are
    handled via identity tracking — each container is copied and sanitized
    exactly once.

    Only dict and list contents are traversed; tuples, sets, and other
    types pass through by reference.
    """
    if max_depth < 0:
        raise ValueError("max_depth must be >= 0")

    # Scalars and strings — no traversal needed
    if isinstance(data, str):
        # clean() returns str; T is str (or a subclass whose extra semantics
        # sanitization intentionally discards).  Mypy can't prove str <: T.
        return cast(T, clean(data, escaper=escaper))
    if not isinstance(data, (dict, list)):
        return data

    # Single-pass iterative copy-and-sanitize (boltons remap pattern).
    # Each stack entry pairs (original, its_new_copy, depth). Children
    # are pushed onto the stack instead of recursing. The `seen` dict

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md
Comment on lines +68 to +72
| **Preserves Unicode** | Yes (CJK, Arabic, emoji¹ intact) | No (destroys all non-ASCII) | Yes | Yes | Yes |
| **Pluggable escaper** | Yes | No | No | No | N/A (HTML-specific) |
| **Dependencies** | Zero | Zero | Zero | wcwidth | C ext / Rust ext |

¹ ZWJ (U+200D) is stripped as a zero-width character, which decomposes ZWJ emoji sequences (e.g. family emoji) into individual emoji. Single emoji are unaffected. Bidi formatting marks (U+061C, U+200E/F, etc.) used in Arabic/Hebrew are also stripped — correct rendering may require re-adding directional marks downstream.
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comparison table says “Preserves Unicode … Arabic … intact” but the new footnote states bidi formatting marks used in Arabic/Hebrew are stripped, which can affect rendering. Also, “U+200E/F” is ambiguous notation; consider spelling out both codepoints (U+200E, U+200F) or using a range (U+200E–U+200F). Updating the table wording and/or footnote marker placement would make the claim consistent and unambiguous.

Suggested change
| **Preserves Unicode** | Yes (CJK, Arabic, emoji¹ intact) | No (destroys all non-ASCII) | Yes | Yes | Yes |
| **Pluggable escaper** | Yes | No | No | No | N/A (HTML-specific) |
| **Dependencies** | Zero | Zero | Zero | wcwidth | C ext / Rust ext |
¹ ZWJ (U+200D) is stripped as a zero-width character, which decomposes ZWJ emoji sequences (e.g. family emoji) into individual emoji. Single emoji are unaffected. Bidi formatting marks (U+061C, U+200E/F, etc.) used in Arabic/Hebrew are also stripped — correct rendering may require re-adding directional marks downstream.
| **Preserves Unicode** | Mostly (CJK and Arabic letters preserved; emoji¹ preserved except ZWJ sequences) | No (destroys all non-ASCII) | Yes | Yes | Yes |
| **Pluggable escaper** | Yes | No | No | No | N/A (HTML-specific) |
| **Dependencies** | Zero | Zero | Zero | wcwidth | C ext / Rust ext |
¹ ZWJ (U+200D) is stripped as a zero-width character, which decomposes ZWJ emoji sequences (e.g. family emoji) into individual emoji. Single emoji are unaffected. Bidi formatting marks (U+061C, U+200E, U+200F, etc.) used in Arabic/Hebrew are also stripped — correct rendering may require re-adding directional marks downstream.

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +10
2. **Legitimate Unicode preserved** --- CJK, Arabic, Hebrew, emoji,¹ and non-confusable text pass through unchanged. A string that passes through unmodified was already clean.

¹ ZWJ (U+200D) is stripped as a zero-width character, decomposing ZWJ emoji sequences into individual emoji. Bidi formatting marks (U+061C, U+200E/F, etc.) are also stripped — see [Stripping Arabic Letter Mark](#stripping-arabic-letter-mark-and-mongolian-fvs) below.
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says Arabic/Hebrew “pass through unchanged” but the footnote notes bidi formatting marks (used in Arabic/Hebrew) are stripped. Also, “U+200E/F” is ambiguous. Consider adjusting the main sentence (or adding footnote markers on Arabic/Hebrew too) and spelling out the exact bidi codepoints (e.g., U+200E, U+200F) so the “preserved” claim is accurate.

Suggested change
2. **Legitimate Unicode preserved** --- CJK, Arabic, Hebrew, emoji,¹ and non-confusable text pass through unchanged. A string that passes through unmodified was already clean.
¹ ZWJ (U+200D) is stripped as a zero-width character, decomposing ZWJ emoji sequences into individual emoji. Bidi formatting marks (U+061C, U+200E/F, etc.) are also stripped — see [Stripping Arabic Letter Mark](#stripping-arabic-letter-mark-and-mongolian-fvs) below.
2. **Legitimate Unicode preserved** --- CJK, Arabic and Hebrew text (except stripped bidi formatting marks), emoji,¹ and non-confusable text pass through unchanged. A string that passes through unmodified was already clean.
¹ ZWJ (U+200D) is stripped as a zero-width character, decomposing ZWJ emoji sequences into individual emoji. Bidi formatting marks (U+061C, U+200E, U+200F, etc.) are also stripped — see [Stripping Arabic Letter Mark](#stripping-arabic-letter-mark-and-mongolian-fvs) below.

Copilot uses AI. Check for mistakes.
**Why they're different:** Transliteration destroys content. Unidecode turns Chinese characters into pinyin, Cyrillic sentences into romanized gibberish, and Arabic into Latin approximations. It's designed for slug generation, not security.

navi-sanitize normalizes only the 66 highest-risk Latin lookalikes and leaves legitimate Unicode intact. CJK, Arabic, emoji, and non-confusable Cyrillic pass through unchanged.
navi-sanitize normalizes only the 66 highest-risk Latin lookalikes and leaves legitimate Unicode intact. CJK, Arabic, emoji,¹ and non-confusable Cyrillic pass through unchanged.
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence claims “emoji … pass through unchanged” but the footnote clarifies ZWJ sequences are decomposed (and the library also strips some bidi/formatting marks). Consider tweaking the wording to avoid “unchanged” for emoji in general, or make the footnote marker placement reflect exactly what’s affected (e.g., “emoji sequences”). This will reduce confusion for readers comparing tools.

Suggested change
navi-sanitize normalizes only the 66 highest-risk Latin lookalikes and leaves legitimate Unicode intact. CJK, Arabic, emoji,¹ and non-confusable Cyrillic pass through unchanged.
navi-sanitize normalizes only the 66 highest-risk Latin lookalikes and leaves legitimate Unicode intact. CJK, Arabic, single emoji,¹ and non-confusable Cyrillic pass through unchanged.

Copilot uses AI. Check for mistakes.
@@ -40,10 +41,23 @@ def _strip_invisible(s: str) -> tuple[str, int]:
return s, 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 LOW: NFKC normalization now returns count of codepoint changes

Confidence: 98%

def _normalize_nfkc(s: str) -&gt; tuple[str, int]:
    # ...
    n = sum(1 for i in range(min_len) if s[i] != normalized[i])
    n += abs(len(s) - len(normalized))

The function _normalize_nfkc() was updated to return a count of affected codepoints rather than a boolean for change detection. This provides granular visibility into how many characters were actually altered by NFKC normalization, which improves logging and makes warnings traceable to the magnitude of normalization.

Suggestion: No action necessary. This change improves both accuracy and diagnostic capability.

— Granular counts for normalization changes add precision and better logs. ...acceptable.

@@ -143,7 +157,9 @@ def walk[T](data: T, *, escaper: Escaper | None = None, max_depth: int = 128) ->

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 LOW: Use of cast(T, ...) eliminates type: ignore

Confidence: 98%

return cast(T, clean(data, escaper=escaper))

Removed type: ignore[return-value] in favor of explicit cast(T, ...) from the typing module in the walk function. This removes type ignores from the codebase and clarifies the intent for the type-checker.

Suggestion: No further action needed. Prefer explicit cast over broad ignores for clarity and mypy cleanliness.

— Explicit cast makes intent clear. The type checker will stop complaining. Progress.

@@ -216,4 +232,6 @@ def _resolve(v: object, depth: int) -> object:
for item in orig_l:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 LOW: Root result of walk uses cast(T, result) for mypy correctness

Confidence: 95%

return cast(T, result)

The final return in walk uses cast(T, result) post-structure-copy, replacing an old type: ignore. This matches the per-item typing discipline of the function.

Suggestion: No follow-up required. The cast gives both safety and silence for type checkers.

— Type: ignore is extinct in this codebase. ...which is how it should be.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grippy approves — PASS (100/100)

@project-navi-bot Navi Bot (project-navi-bot) merged commit c846736 into main Apr 5, 2026
18 checks passed
@project-navi-bot Navi Bot (project-navi-bot) deleted the docs/audit-and-positioning branch April 5, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants