[Feature] Add Multilingual Support 

**What problem does this solve?**

MemPalace currently only works with English. Room classification uses English keyword lists, memory extraction uses English regex patterns, entity detection uses English-only `\b[A-Z][a-z]+\b`, and spellcheck corrupts CJK text. Non-English users (Chinese, French, Japanese, etc.) get incorrect room assignments, no memory type detection, and broken spellcheck.

The MemChinesePalace fork (https://github.com/Chandler-Sun/MemChinesePalace) shows community demand exists — but it's a full rewrite that drops many features (onboarding, entity registry, palace graph, etc.).

**What's the proposed solution?**

Use embedding-based semantic classification instead of per-language keyword/regex patterns. The multilingual embedding model (`paraphrase-multilingual-MiniLM-L12-v2`, ~120MB, CPU-only) understands 50+ languages natively.

Concrete changes:
- **Room classification**: replace `TOPIC_KEYWORDS` matching with cosine similarity against room description embeddings. Any language works with zero configuration.
- **Memory extraction**: replace per-language regex markers with cosine similarity against memory type description embeddings (decision, preference, milestone, problem, emotional). Each paragraph classified independently to preserve multi-type detection.
- **Entity detection**: add Chinese name patterns (百家姓 surnames, simplified+traditional). This stays rule-based because NER is fundamentally pattern matching, not classification.
- **Embedding model**: configurable via `config.json` / env var, default multilingual. `sentence-transformers` is an optional dependency (`pip install mempalace[multilingual]`). Core functionality works without it (English regex fallback).
- **Spellcheck**: auto-skips non-English text via Unicode detection.
- **All 7 ChromaDB consumer modules** (searcher, miner, convo_miner, mcp_server, layers, palace_graph, cli) use a centralized `get_embedding_function()` from config.py.

Benchmark results — 173 test cases across 8 languages (zh-Hans, zh-Hant, en, fr, es, de, ja, ko):
```
Base: 122/122 (100%)    Extended: 51/51 (100%)    219 unit tests, 0 regressions
```

One new optional dependency: `sentence-transformers>=2.0` (install via `pip install mempalace[multilingual]`). Without it, the system falls back to English regex — existing behavior unchanged. No external API calls. Everything runs locally. No GPU needed.

**Alternatives considered:**

1. **Per-language regex patterns** (like MemChinesePalace): Requires ~250 manually curated items per language. Not scalable — adding French/German/Japanese would each need someone to write language-specific keyword lists and regex patterns.

2. **LLM-in-the-loop classification**: Use Claude/GPT to classify content. Best accuracy but violates MemPalace's "zero API, local only" principle and adds latency/cost.

3. **External NLP models (spaCy, etc.)**: Heavier dependency, larger download, and still language-specific (need per-language model). The sentence-transformer approach is lighter and inherently multilingual.

The embedding approach is the sweet spot: zero per-language config, local-only, lightweight (~120MB model), and 100% benchmark accuracy.

I have a working implementation in my fork: https://github.com/EndeavorYen/mempalace , happy to open a PR if the approach looks good. Also open to splitting into smaller PRs or adjusting based on feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add Multilingual Support #231

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Add Multilingual Support #231

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions