Skip to content

Extend privacy-guard masking from tool results to free-text user prompts (RFC + validation plan) #361

Description

@iret77

Problem

The privacy-guard plugin currently masks PII only in structured tool result payloads before they reach the LLM, then restores the original values after the turn. Free-text user prompts are not processed. If a user writes

"What should we pay Anna Schmidt (32, lives at Bahnhofstr. 5, 60311 Frankfurt) given her current salary of €72,000?"

every one of those identifiers reaches the LLM verbatim. This is a privacy hole in a feature that is otherwise a core value prop.

What we can reuse (restore is already solved)

The hard parts of the existing implementation are generic and reusable for prompts:

  • «TYPE_N»-token format and per-turn TokenizeMap
  • Restore pass on the LLM response
  • System-prompt directive that teaches the LLM about tokens (with the Token-Storm fallback)
  • Three-tier allowlist hierarchy
  • Detector registry with confidence-based span deduplication and word-boundary extension

The new work is exclusively detecting PII spans in unstructured text. That is fundamentally harder than schema-driven JSON masking because:

  • no schema → spans must be detected probabilistically;
  • coreference matters → restored text must preserve LLM utility;
  • the failure mode is asymmetric: a missed span = privacy leak, over-masking = token-soup prompt the LLM can't answer.

Approach landscape (ranked by integration value for omadia)

Detection ≠ substitution. Substitution is mostly solved by the existing token map; detection is the actual gap. Ranking criterion = how worth pursuing for omadia.

# Approach (layer) Pro Con Why this rank
1 Hybrid detector ensemble on the prompt — dedicated PII transformer (GLiNER-PII / Piiranha) added to existing Regex/Presidio/LLM detectors, confidence-reconciled (Detection) Closes the free-text recall gap; uses existing detector registry + dedup; on-prem More latency/engineering; locale tuning + threshold calibration needed Best quality-per-effort: fits omadia's architecture, directly addresses the gap, no new trust boundary
2 Local-LLM pass for contextual / quasi-identifiers (Detection) — already present via Ollama detector Only layer that catches context-only PII ("my daughter who broke her leg last summer"); semantic Latency, nondeterminism, hallucination/miss; not guarantee-able The only thing catching context-only PII that NER/regex structurally cannot; must stay as complement, not base
3 Consistent realistic surrogates (Faker-style, LangChain PresidioReversibleAnonymizer pattern) instead of opaque tokens (Substitution) Preserves fluency + coreference → better LLM answer; defuses "token storm" Surrogate can collide with real text; restore needs exact uniqueness Substitution-layer upgrade with high leverage on response quality, but a refinement, not a closing of the gap
4 Opaque placeholder tokens + turn map (current) extended to prompts (Substitution) Already built, proven, leaks nothing; near-zero cost to extend Does not solve detection; degrades on heavy free text (reasoning/coreference) Necessary baseline, but on its own does not deliver the feature value
5 Format-Preserving Encryption / crypto tokenization for format-bound IDs (Substitution) Stateless reversibility via key, preserves format/referential integrity Only meaningful for numbers/IDs, not names/sentences; key management Narrow scope — elegant for IDs, inapplicable to the bulk of free-text PII
6 Commercial cloud DLP (Google SDP, AWS Comprehend, Azure) — detection + reidentify (Buy/Detection) Mature, multilingual detection; native reversible crypto tokens Ships the very PII you're protecting to a third party → new trust boundary, conflicts with data-residency mode Strong capability, but the trust boundary breaks the privacy story
7 Commercial LLM privacy vault/gateway (Skyflow, Private AI, Protecto) (Buy) Turnkey: detect + tokenize + detokenize, context-preserving, compliance tooling This is omadia's own differentiator Adopting = outsourcing the core feature; useful only as competitive reference
8 Local Differential Privacy text sanitization (paradigm) Formal guarantees, detection-free, defends against inference attacks Not exactly reversible by design (incompatible with the restore goal); degrades utility; recent work shows LLM reconstruction Listed for completeness, structurally disqualified by the exact-restore requirement

Recommended path: Approach #1

Add a dedicated PII transformer as a new detector in the existing registry; orchestrate it together with the current regex/Presidio/Ollama detectors via the existing confidence-reconciliation/dedup logic, and apply the full ensemble to the user prompt before it reaches the LLM. Substitution stays on the current opaque-token + turn-map mechanism (approach #4 above); revisiting it for realistic surrogates (#3) is a separate follow-up issue.

Scope decision: 6 Western-EU Latin-script languages

Target locales for the initial implementation: EN, DE, FR, ES, IT, NL. This lands exactly where the candidate model and dataset live, which makes both the model choice and the eval set genuinely lean.

Out of scope for this round (and therefore a re-test trigger when market expansion brings them in):

  • CJK (中文 / 日本語 / 한국어) — no word boundaries, breaks the word-boundary-extension trick
  • MENA / RTL (Arabic, Hebrew)
  • Cyrillic, Indic, Turkish, other low-resource locales

Honest cross-impact: internationally, no single fixed transformer covers all locales — a future re-test for non-Latin scripts will lean more heavily on a multilingual GLiNER variant + the LLM-pass (#2). The architecture is the same; the model mix changes.

Decision gate: lean validation plan

Before integrating anything, run a standalone validation that answers one question: does the candidate detector ensemble reach the per-language quality bar required to justify shipping? Pass/fail thresholds are committed before the run, otherwise it's a vibe check.

Configurations compared

  • C0 (control): current omadia detectors — regex + Presidio + Ollama
  • C1 (candidate): C0 + Piiranha-v1 (iiiorg/piiranha-v1-detect-personal-information) added to the detector registry
  • Ablations: each detector solo, to expose marginal contribution

Alternative / second measurement point: GLiNER-multi (zero-shot, tunable label set) and a DeBERTa fine-tuned on ai4privacy as a ceiling check.

Eval set (~750–1000 items)

  • Backbone, near-free: balanced slice from ai4privacy pii-masking-300k, ~100–150 items × 6 languages (~600–900), already human-validated
  • Hard slice, hand-built: ~20–30/language focusing on context-only PII, per-locale ID formats (Steuer-ID, codice fiscale, NINO, NIE, BSN), ambiguous tokens, multi-part Spanish surnames. EN/DE in-house; FR/ES/IT/NL via LLM-generation + native-speaker spotcheck
  • Negatives: 20–30 % PII-free prompts per language → measures over-masking directly

Critical methodological caveat

Piiranha is trained on ai4privacy-style data, so evaluating Piiranha on ai4privacy is partially in-distribution and will inflate numbers. The honest go/no-go signal is the out-of-distribution hand slice, not the ai4privacy slice. The ai4privacy backbone provides breadth and per-language coverage; the hand slice tells us whether the model generalizes to real omadia prompts.

Scoring

  • Instance-level matching via nervaluate (Exact-Match): a PII instance counts as masked only if the detected span covers it fully — any uncovered identifying character = leak. Stricter than standard NER F1, but the right lens for the privacy goal.
  • Per language × per severity tier:
Tier Entity types Recall gate Leak tolerance
Critical Financial (IBAN, salary), health, ID numbers (Steuer-ID, SSN-equiv) ≥ 0.98 ~0
High Name, address, DOB, email, phone ≥ 0.95 low
Medium Age, job, employer, other quasi-IDs ≥ 0.85 tolerable
  • Guardrails: precision ≥ 0.85 (over-masking); added p95 latency ≤ budget (proposed 150–300 ms; final value tunable based on UX measurement)
  • Decision rule: C1 must beat C0 on leak-rate (Critical + High tiers) in every one of the 6 languages; the weakest language gates the verdict.

Effort

~2–4 days total:

  • Harness (detector calls → nervaluate → per-language/per-type aggregation + latency): ~1 day
  • Eval set (mostly the hand slice): ~1–2 days
  • Analysis + categorized miss list: ~0.5 day

Deliberately out of scope (to keep the validation lean)

  • No wiring into the plugin pipeline yet — detectors run standalone against the text
  • No mask → LLM → restore round-trip (restore is already solved); optional Phase 2
  • No LLM-response utility measurement (that belongs to the substitution layer)
  • No commercial cloud DLP path (would breach the on-prem trust boundary)
  • No DP-style text sanitization (incompatible with exact restore)

Open questions / risks

  • Per-locale recall variance: Piiranha's reported overall accuracy hides per-language variance, especially on lower-frequency entity types and quasi-identifiers. The test must surface this. If e.g. NL recall on Critical-tier IDs is below gate, the option is feature-flagged rollout or LLM-pass-only fallback for that language.
  • Latency budget on the prompt path: Piiranha is 280M parameters; on CPU this is non-trivial added latency for every user prompt. The p95 budget needs an actual UX number, not a guess.
  • In-distribution inflation of Piiranha vs. ai4privacy as noted — hand slice is the real signal.
  • Re-test trigger: any new market that requires non-Latin script support invalidates the model choice and reopens the ranking.

Next step

Build the harness + eval set, run the validation, post results back into this issue, then decide go/no-go.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions