Skip to content

Commit 03439b9

Browse files
DilawarShafiqclaude
andcommitted
fix(tests): share NLP engine across recognizer tests to prevent CI OOM
Each test was creating a new AnalyzerEngine with nlp_engine=None which loaded en_core_web_lg (~700MB) once per test. 53 tests × 5s = 265s+ and accumulated memory caused GitHub Actions runner OOM kills. Fix: module-level shared NLP engine loaded once per session. Result: test_hipaa_recognizers.py 30s → 7.72s, no per-test spaCy load. Also: PHI_REDACTOR_SPACY_MODEL env var wired into detection engine and test helper so CI can use en_core_web_sm without code changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent c9e1f0f commit 03439b9

1 file changed

Lines changed: 16 additions & 2 deletions

File tree

tests/unit/test_hipaa_recognizers.py

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,21 @@
1212

1313
from __future__ import annotations
1414

15+
import os
16+
1517
import pytest
1618
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
19+
from presidio_analyzer.nlp_engine import NlpEngineProvider
20+
21+
# Shared NLP engine — loaded once per test session to avoid OOM from repeated
22+
# spaCy model loads. Uses the same model override as the detection engine.
23+
_SPACY_MODEL = os.environ.get("PHI_REDACTOR_SPACY_MODEL", "en_core_web_lg")
24+
_SHARED_NLP_ENGINE = NlpEngineProvider(
25+
nlp_configuration={
26+
"nlp_engine_name": "spacy",
27+
"models": [{"lang_code": "en", "model_name": _SPACY_MODEL}],
28+
}
29+
).create_engine()
1730

1831
from phi_redactor.detection.recognizers.account import AccountRecognizer
1932
from phi_redactor.detection.recognizers.biometric import BiometricRecognizer
@@ -32,7 +45,8 @@
3245
def _build_analyzer(*recognizers) -> AnalyzerEngine:
3346
"""Create an AnalyzerEngine with only the given recognizer(s) loaded.
3447
35-
Uses a simple regex NLP engine (no spaCy required) to keep tests fast.
48+
Reuses the module-level shared NLP engine to avoid loading spaCy once per
49+
test (which causes OOM in CI when the full test suite runs).
3650
"""
3751
registry = RecognizerRegistry()
3852
# Do NOT load predefined recognizers -- we want isolation
@@ -42,7 +56,7 @@ def _build_analyzer(*recognizers) -> AnalyzerEngine:
4256
return AnalyzerEngine(
4357
registry=registry,
4458
supported_languages=["en"],
45-
nlp_engine=None,
59+
nlp_engine=_SHARED_NLP_ENGINE,
4660
)
4761

4862

0 commit comments

Comments
 (0)