Add NLP capabilities with local models - adds multi-lingual support and improves evaluation results by tmuskal · Pull Request #507 · MemPalace/mempalace

tmuskal · 2026-04-10T07:54:22Z

What does this PR do?

increases the results of the benchmarks while removing all constant, hardcoded keywords and regexes from all NLP mechanisms. eliminates benchmark overfitting.

What	Added Size	Key Gap Closed
pySBD + negation detection	~50 KB	G3 (sentences), G4 partial (negation)
spaCy xx_ent_wiki_sm + coreferee	~15 MB	G1 (NER), G6 (multilingual), G7 (coref)
GLiNER2 ONNX	~400 MB model	G2 (KG triples), G4 (classification)
wtpsplit sat-3l-sm ONNX	~20 MB model	G3 best-in-class, G6 (85 langs)
phi-3.5-mini onnxruntime-genai	1.5 GB model	G5 (sentiment), all enhanced, triples and knowledge extractions

All share onnxruntime which chromadb already installs. Every phase is backward compatible -- legacy mode produces identical output to current version. Each phase is gated behind a config flag so users opt in explicitly.

Add multilingual support, better NLP capabilities and better quality of results overall.
No more regexes and predefined lists in benchmarks and in implementation. Implementation is no longer overfitted.

Speed test shows that the model-powered implementation is 40% slower when running on github runners CPUs.
All previous tests pass.

How to test

See new test pipelines in the ci (test-nlp , benchmark-quality , benchmark-speed )

Checklist

Tests pass (python -m pytest tests/ -v)
No hardcoded paths
Linter passes (ruff check .)

- Implemented `negation.py` to detect negation cues in text, affecting keyword scoring. - Created `registry.py` for managing NLP provider registration, lazy loading, and fallback logic. - Added tests for CLI commands related to NLP functionality in `test_nlp_cli.py`. - Developed tests for NLP configuration and feature gates in `test_nlp_config.py`. - Introduced comprehensive tests for NLP providers, including legacy provider and negation functionality in `test_nlp_providers.py`.

…ntation and entity extraction

- Implemented SLMProvider for sentiment analysis, triple extraction, and coreference resolution using onnxruntime-genai. - Developed WtpsplitProvider for sentence segmentation with lazy loading and thread safety. - Created GLiNERProvider for named entity recognition, relation extraction, and text classification. - Added unit tests for SLMProvider, WtpsplitProvider, and GLiNERProvider to ensure functionality and error handling. - Included mock implementations for dependencies to facilitate testing without requiring actual installations.

…ations

- Fix coreferee/spacy version conflict by removing coreferee from nlp-full (added separate nlp-coref group for spacy<3.6 compatibility) - Fix GLiNER test mocks to patch ModelManager via sys.modules instead of classmethod patch (fixes failures on Python 3.9) - Update benchmark to use actual mempalace APIs (dialect, entity_detector, general_extractor) instead of reimplementing logic - Add CI benchmark job comparing speed with/without NLP providers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The global MEMPALACE_NLP_* env vars polluted existing tests that expect default (off) behavior. Move env vars to specific steps that need them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nvironment handling

…peed benchmarks

…ove quality evaluation

Add nlp_aaak and nlp_hybrid modes to longmemeval_bench.py that use NLP providers for entity detection, sentence splitting, and compression during ingestion. Rewrite bench_nlp_providers.py as a thin wrapper that runs baseline vs NLP-enhanced comparisons on the LongMemEval dataset, producing directly comparable Recall@k and NDCG@k scores. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The tests/benchmarks/ modules use absolute imports like `from tests.benchmarks.data_generator import PalaceDataGenerator` which requires tests/ to be a Python package. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Prevent indefinite hangs in speed benchmarks by adding 30-minute job timeout and 10-minute step timeouts for each benchmark run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Run only test_knowledge_graph_bench and test_ingest_bench in CI speed benchmarks to avoid chromadb-heavy tests that time out. Add continue-on-error so benchmark failures don't block the pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rewrite bench_nlp_providers.py to auto-download and run the LongMemEval benchmark dataset (500 questions from HuggingFace). Produces standard Recall@k and NDCG@k scores identical to longmemeval_bench.py, comparing raw baseline vs nlp_aaak and nlp_hybrid retrieval modes. Also fix UTF-8 encoding for dataset loading on Windows. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Download the dataset and run 20 questions in CI to produce real Recall@k/NDCG@k scores. Baseline (raw) runs without NLP, then nlp_aaak runs with NLP flags enabled for comparison. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ing in retrieval

…tion, and intent classification

…tion, and intent classification in AAak mode

…classification

web3guru888 · 2026-04-10T08:02:57Z

This is a significant piece of work and the architecture is well-considered. The graduated backend levels (legacy → pysbd → spacy → gliner → full) with cumulative capability gating, and all-off-by-default behavior, is exactly the right way to introduce optional ML dependencies into a project like this.

A few observations from reading through the diff:

nlp-coref vs nlp-full version conflict
nlp-coref requires spacy>=3.5,<3.6 (coreferee compatibility), but nlp-full requires spacy>=3.7. A user who installs nlp-full and then tries to add coreference will hit a version incompatibility. The README or install docs should call this out explicitly, and it's worth documenting whether coreferee support will ever land in the full tier.

Model downloads at runtime
The ModelManager lazily downloads models on first use. In sandboxed or air-gapped environments (CI, Docker with no outbound HTTP, corporate proxies), this will silently fail and fall back to legacy rather than erroring loudly. A mempalace nlp prefetch command (or equivalent) that can be run at container build time would be a useful operational addition.

Gemma 3 1B as SLM for triples
We've been extracting KG triples ourselves using a lightweight approach (sentence-level entity pairs + co-occurrence). We've found that instruction-following prompts to small models for triple extraction are brittle — the JSON format compliance rate drops significantly for complex sentences. GLiNER2 for relation extraction is probably more reliable for triples than the SLM prompt approach. The TRIPLES_PROMPT in slm_provider.py may produce inconsistent output on out-of-domain content.

Test coverage on model download paths
The test suite for gliner_provider and nlp_e2e looks thorough. Is the ModelManager download/verify path tested with mocked HTTP? Model download failures (corrupt download, 404, timeout) are a common source of silent degradation.

40% slowdown note
The PR notes 40% slower on GitHub runner CPUs. For interactive use (Claude Code sessions, MCP hooks), this is likely invisible. For batch mining of large session logs it could be noticeable. A per-operation timing note in the speed benchmark output would help users decide which backend level is appropriate for their workload.

Overall: the opt-in architecture is sound, the legacy path is preserved, and the feature gating is clean. The coreferee version conflict and the runtime download behavior are the two things I'd want resolved or documented before merge.

MemPalace-AGI integration dashboard

igorls · 2026-04-10T14:33:58Z

Nice job @tmuskal Great move forward! The big win here is multilingual support and fixing the benchmark overfitting. @bensig @milla-jovovich

bensig · 2026-04-10T15:07:03Z

Code review

Found 2 issues:

huggingface_hub is used in _download_from_hf() but is not declared in any optional dependency group in pyproject.toml — not even nlp-slm, which is the only tier that triggers HuggingFace downloads (Phi-3.5 Mini). Users who install mempalace[nlp-slm] will get a silent failure when trying to download the model, since the ImportError is caught and logged but no install hint is printed. huggingface_hub should be added to the nlp-slm (and probably nlp-full) extras.

https://github.com/milla-jovovich/mempalace/blob/0423cbdc06e2bdcbeabc06d594e5d555d2aec9c7/mempalace/nlp_providers/model_manager.py#L314-L330

The --nlp-backend CLI flag is parsed but never consumed. args.nlp_backend (set by parser.add_argument("--nlp-backend", ...)) is never read in main() or passed to NLPConfig.resolve(cli_backend=...). Running mempalace --nlp-backend spacy mine ... silently ignores the flag and uses the default legacy backend. NLPConfig.resolve already supports a cli_backend parameter — it just needs to be wired up at the call sites.

https://github.com/milla-jovovich/mempalace/blob/0423cbdc06e2bdcbeabc06d594e5d555d2aec9c7/mempalace/cli.py#L553-L559

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

web3guru888 · 2026-04-10T15:20:46Z

Both of @bensig's finds look correct — I can validate from our side:

huggingface_hub missing from nlp-slm: Yes, this would cause a silent failure. The _download_from_hf() function catches ImportError and logs it, but the user sees nothing actionable when they installed mempalace[nlp-slm] expecting the download to work. Easy fix — add huggingface-hub>=0.20 to the nlp-slm and nlp-full extras in pyproject.toml.

--nlp-backend flag not wired up: This is a footgun for anyone who reads the help text and tries to override the backend per-invocation. The wiring looks straightforward given NLPConfig.resolve(cli_backend=...) already exists — just needs NLPConfig.resolve(cli_backend=args.nlp_backend) at each call site in main().

- Add huggingface_hub>=0.20 to nlp-full and nlp-slm optional deps so model downloads work without manual pip install - Wire args.nlp_backend to MEMPALACE_NLP_BACKEND env var so the --nlp-backend CLI flag is actually consumed by NLPConfig.resolve() Fixes review comments from PR MemPalace#507. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

tmuskal · 2026-04-10T19:09:03Z

Thanks @bensig and @web3guru888 for the review! Both issues are now fixed:

1. huggingface_hub missing from optional deps — Added huggingface_hub>=0.20 to both nlp-full and nlp-slm extras in pyproject.toml. Users installing mempalace[nlp-slm] will now get the dependency automatically instead of hitting a silent ImportError during model download.

2. --nlp-backend CLI flag silently ignored — Wired args.nlp_backend to set MEMPALACE_NLP_BACKEND env var right after arg parsing, before command dispatch. NLPConfig.resolve() now picks up the CLI flag everywhere.

Commit: 9205d8b

Generated by babysitter

web3guru888 · 2026-04-10T19:16:28Z

Both fixes look correct from here.

huggingface_hub in both nlp-full and nlp-slm — right call to add it to both extras. Users who install the lighter nlp-slm tier are exactly the ones most likely to hit the silent failure, so explicit coverage there matters.

--nlp-backend env var injection — wiring it right after arg parsing (before command dispatch) is the correct placement. Setting MEMPALACE_NLP_BACKEND means NLPConfig.resolve() picks it up regardless of how deep the call stack goes, without threading args into every submodule. Clean.

LGTM on both. Would be good to land this — the graduated backend levels are genuinely useful for environments where you can't control which ML libraries are installed.

web3guru888

Review: Add NLP Capabilities with Local Models

This is a big PR — 6,385 lines added across 38 files — but the architecture is sound and the implementation is much more careful than most "add AI to everything" PRs I've reviewed.

What's done well

Opt-in by default is the right call. Everything is off unless an env var is set. The NLPConfig feature gate system is clean: per-feature overrides beat backend-level beats config beats default. Users who don't install any extras won't see any behaviour change.

Lazy loading everywhere. Every provider wraps its import in _ensure_loaded() with a double-checked lock. This means import time stays clean and the penalty is paid only when the feature is actually used.

Graceful degradation. Every call site wraps NLP use in try/except Exception: pass and falls back to the regex pipeline. We use a similar pattern in our integration and it's proven robust through 540+ discovery cycles.

Provider protocol abstraction. The NLPProvider base Protocol lets providers opt in to only the capabilities they support. The registry's fallback chain (NLP provider → legacy provider → regex inline) is explicit and easy to reason about.

Issues found

nlp-coref version conflict is buried. The pyproject.toml note says nlp-coref requires spacy<3.6 while nlp/nlp-full require >=3.7. This isn't just a comment issue — a user who runs pip install -e ".[nlp,nlp-coref]" will get a resolver error or silent downgrade. Consider adding a pip install guard in the CLI or at least a mempalace nlp install --coref warning.

NLPConfig.resolve() is called on every extraction. In entity_detector.py, general_extractor.py, and dialect.py, each call creates a fresh NLPConfig.resolve() and may call get_registry(). For high-throughput mining (we regularly mine 10K+ files), this cold-path overhead adds up. Consider caching the config as a module-level singleton, invalidated only when env vars change.

_extract_triples_if_enabled in miner.py creates a new KnowledgeGraph per file. Each KnowledgeGraph(db_path=...) call opens a new SQLite connection. With our workloads this causes connection churn. The process_file signature could accept an optional kg parameter so the caller can pass a long-lived instance.

general_extractor.py single-label override. When NLP classify returns a result, scores = {classification["label"]: 5} completely bypasses the multi-marker regex pipeline. If the NLP model misclassifies, there's no blending with the lower-confidence regex signals. A weighted blend (e.g., NLP score × 3 + regex scores) would be more robust.

pySBD language="en" hardcoded in pysbd_provider.py. If the corpus is Brazilian Portuguese (new in PR #156), sentence boundaries will be worse than the regex fallback for certain constructions. Consider accepting a language param or reading from the NLP config.

test_nlp_integration.py is continue-on-error: true in CI. That's pragmatic for a first pass but means NLP regressions won't block merges. A separate required CI job for at least the pySBD + spaCy tiers would be worthwhile.

Suggestions

Cache NLPConfig.resolve() at module level (lazily, re-read if MEMPALACE_NLP_* env vars change)
Pass a kg instance into _extract_triples_if_enabled instead of creating one per file
Add a --coref flag warning in mempalace nlp install about the spaCy version conflict
Consider NLP+regex blending in general_extractor instead of NLP-wins-completely
Add pySBD language param to config so pt-BR content gets appropriate sentence splitting

Overall

This is excellent foundational work. The architecture is extensible, the defaults are safe, and the benchmarks give concrete before/after numbers. The issues above are real but none are blockers — the biggest one (config caching) is a performance concern rather than a correctness bug. Would love to see this land and iterate on the performance story in a follow-up.

APPROVED with the suggestions above as non-blocking follow-ups.

Reviewed by MemPalace-AGI — autonomous research system with perfect memory

tmuskal and others added 29 commits April 9, 2026 21:38

Merge branch 'main' of https://github.com/tmuskal/mempalace

1cb66ce

Merge branch 'main' of https://github.com/tmuskal/mempalace

bfb336e

feat: add PySBD and SpaCy NLP providers with tests for sentence segme…

cc76e8a

…ntation and entity extraction

feat: add NLP integration tests for pySBD, spaCy, GLiNER, and WTPSplit

b3bf393

feat: add optional NLP dependencies for basic, full, and SLM configur…

ea30b80

…ations

feat: implement NLP-based sentence splitting and entity detection

88f8f76

feat: enhance entity extraction with NLP NER provider support

d6f4a88

feat: implement NLP classification for memory extraction scoring

138b76c

feat: add NLP triple extraction functionality to process_file

075a79e

feat: add tests for NLP feature flag functionality in various modules

6765a97

fix: don't set NLP env vars globally in test-nlp CI job

37df295

The global MEMPALACE_NLP_* env vars polluted existing tests that expect default (off) behavior. Move env vars to specific steps that need them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: update spaCy sentence segmentation test to handle model limitations

67f2ff1

fix: enhance entity code generation and update NLP config tests for e…

edcfdda

…nvironment handling

fix: update CI workflow to enhance NLP feature flag testing and add s…

e77893a

…peed benchmarks

fix: update NLP provider benchmark to include self-test mode and impr…

290658e

…ove quality evaluation

fix: add tests/__init__.py for benchmark imports

e7f96d9

The tests/benchmarks/ modules use absolute imports like `from tests.benchmarks.data_generator import PalaceDataGenerator` which requires tests/ to be a Python package. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: add timeouts to benchmark-speed CI job

44e1e6c

Prevent indefinite hangs in speed benchmarks by adding 30-minute job timeout and 10-minute step timeouts for each benchmark run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: enhance NLP processing with entity extraction and temporal scor…

e307fab

…ing in retrieval

feat: enhance NLP capabilities with entity extraction, temporal detec…

ba45e64

…tion, and intent classification

feat: enhance NLP capabilities with entity extraction, temporal detec…

214fe4f

…tion, and intent classification in AAak mode

feat: implement NLP-based temporal detection and assistant-reference …

1cf7ec4

…classification

Merge branch 'milla-jovovich:main' into main

2b8af02

tmuskal marked this pull request as ready for review April 10, 2026 13:29

bensig requested review from bensig and milla-jovovich April 10, 2026 15:02

tmuskal and others added 2 commits April 10, 2026 22:04

Merge branch 'main' into main

cf23f6c

web3guru888 reviewed Apr 11, 2026

View reviewed changes

This was referenced Apr 11, 2026

fix: entity detection prefers git repo directory names over README content #158

Closed

fix: batch ChromaDB upserts and parallelize file processing for Windo… #298

Closed

bensig and others added 2 commits April 10, 2026 22:45

Merge branch 'main' into main

26cbe22

Merge branch 'main' into main

4242006

bensig changed the base branch from main to develop April 11, 2026 22:22

bensig requested a review from igorls as a code owner April 11, 2026 22:22

This was referenced Apr 12, 2026

feat: rule-based contradiction detection (issue #27) #433

Closed

feat: add regex-based PII detection and sanitization module #182

Closed

fix: filter programming keywords from entity detection (#348) #349

Closed

Merge branch 'develop' into main

c83a144

tmuskal changed the title ~~Add NLP capabilities with local models~~ Add NLP capabilities with local models - adds multi-lingual support and improves evaluation results Apr 12, 2026

bensig mentioned this pull request Apr 12, 2026

feat: i18n support — 8 languages for MemPalace #718

Merged

6 tasks

mvalentsev mentioned this pull request Apr 13, 2026

fix: address i18n review issues from PR #718 #758

Merged

igorls added area/ci CI/CD and workflows area/cli CLI commands area/install pip/uv/pipx/plugin install and packaging area/kg Knowledge graph area/mining File and conversation mining labels Apr 14, 2026

Merge branch 'develop' into main

74900e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NLP capabilities with local models - adds multi-lingual support and improves evaluation results#507

Add NLP capabilities with local models - adds multi-lingual support and improves evaluation results#507
tmuskal wants to merge 47 commits intoMemPalace:developfrom
tmuskal:main

tmuskal commented Apr 10, 2026 •

edited

Loading

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

igorls commented Apr 10, 2026

Uh oh!

bensig commented Apr 10, 2026

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

tmuskal commented Apr 10, 2026

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tmuskal commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How to test

Checklist

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

igorls commented Apr 10, 2026

Uh oh!

bensig commented Apr 10, 2026

Code review

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

tmuskal commented Apr 10, 2026

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Review: Add NLP Capabilities with Local Models

What's done well

Issues found

Suggestions

Overall

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tmuskal commented Apr 10, 2026 •

edited

Loading