What problem does this solve?
MemPalace currently only works with English. Room classification uses English keyword lists, memory extraction uses English regex patterns, entity detection uses English-only \b[A-Z][a-z]+\b, and spellcheck corrupts CJK text. Non-English users (Chinese, French, Japanese, etc.) get incorrect room assignments, no memory type detection, and broken spellcheck.
The MemChinesePalace fork (https://github.com/Chandler-Sun/MemChinesePalace) shows community demand exists — but it's a full rewrite that drops many features (onboarding, entity registry, palace graph, etc.).
What's the proposed solution?
Use embedding-based semantic classification instead of per-language keyword/regex patterns. The multilingual embedding model (paraphrase-multilingual-MiniLM-L12-v2, ~120MB, CPU-only) understands 50+ languages natively.
Concrete changes:
- Room classification: replace
TOPIC_KEYWORDS matching with cosine similarity against room description embeddings. Any language works with zero configuration.
- Memory extraction: replace per-language regex markers with cosine similarity against memory type description embeddings (decision, preference, milestone, problem, emotional). Each paragraph classified independently to preserve multi-type detection.
- Entity detection: add Chinese name patterns (百家姓 surnames, simplified+traditional). This stays rule-based because NER is fundamentally pattern matching, not classification.
- Embedding model: configurable via
config.json / env var, default multilingual. sentence-transformers is an optional dependency (pip install mempalace[multilingual]). Core functionality works without it (English regex fallback).
- Spellcheck: auto-skips non-English text via Unicode detection.
- All 7 ChromaDB consumer modules (searcher, miner, convo_miner, mcp_server, layers, palace_graph, cli) use a centralized
get_embedding_function() from config.py.
Benchmark results — 173 test cases across 8 languages (zh-Hans, zh-Hant, en, fr, es, de, ja, ko):
Base: 122/122 (100%) Extended: 51/51 (100%) 219 unit tests, 0 regressions
One new optional dependency: sentence-transformers>=2.0 (install via pip install mempalace[multilingual]). Without it, the system falls back to English regex — existing behavior unchanged. No external API calls. Everything runs locally. No GPU needed.
Alternatives considered:
-
Per-language regex patterns (like MemChinesePalace): Requires ~250 manually curated items per language. Not scalable — adding French/German/Japanese would each need someone to write language-specific keyword lists and regex patterns.
-
LLM-in-the-loop classification: Use Claude/GPT to classify content. Best accuracy but violates MemPalace's "zero API, local only" principle and adds latency/cost.
-
External NLP models (spaCy, etc.): Heavier dependency, larger download, and still language-specific (need per-language model). The sentence-transformer approach is lighter and inherently multilingual.
The embedding approach is the sweet spot: zero per-language config, local-only, lightweight (~120MB model), and 100% benchmark accuracy.
I have a working implementation in my fork: https://github.com/EndeavorYen/mempalace , happy to open a PR if the approach looks good. Also open to splitting into smaller PRs or adjusting based on feedback.
What problem does this solve?
MemPalace currently only works with English. Room classification uses English keyword lists, memory extraction uses English regex patterns, entity detection uses English-only
\b[A-Z][a-z]+\b, and spellcheck corrupts CJK text. Non-English users (Chinese, French, Japanese, etc.) get incorrect room assignments, no memory type detection, and broken spellcheck.The MemChinesePalace fork (https://github.com/Chandler-Sun/MemChinesePalace) shows community demand exists — but it's a full rewrite that drops many features (onboarding, entity registry, palace graph, etc.).
What's the proposed solution?
Use embedding-based semantic classification instead of per-language keyword/regex patterns. The multilingual embedding model (
paraphrase-multilingual-MiniLM-L12-v2, ~120MB, CPU-only) understands 50+ languages natively.Concrete changes:
TOPIC_KEYWORDSmatching with cosine similarity against room description embeddings. Any language works with zero configuration.config.json/ env var, default multilingual.sentence-transformersis an optional dependency (pip install mempalace[multilingual]). Core functionality works without it (English regex fallback).get_embedding_function()from config.py.Benchmark results — 173 test cases across 8 languages (zh-Hans, zh-Hant, en, fr, es, de, ja, ko):
One new optional dependency:
sentence-transformers>=2.0(install viapip install mempalace[multilingual]). Without it, the system falls back to English regex — existing behavior unchanged. No external API calls. Everything runs locally. No GPU needed.Alternatives considered:
Per-language regex patterns (like MemChinesePalace): Requires ~250 manually curated items per language. Not scalable — adding French/German/Japanese would each need someone to write language-specific keyword lists and regex patterns.
LLM-in-the-loop classification: Use Claude/GPT to classify content. Best accuracy but violates MemPalace's "zero API, local only" principle and adds latency/cost.
External NLP models (spaCy, etc.): Heavier dependency, larger download, and still language-specific (need per-language model). The sentence-transformer approach is lighter and inherently multilingual.
The embedding approach is the sweet spot: zero per-language config, local-only, lightweight (~120MB model), and 100% benchmark accuracy.
I have a working implementation in my fork: https://github.com/EndeavorYen/mempalace , happy to open a PR if the approach looks good. Also open to splitting into smaller PRs or adjusting based on feedback.