Skip to content

[Feature] Add Multilingual Support  #231

@EndeavorYen

Description

@EndeavorYen

What problem does this solve?

MemPalace currently only works with English. Room classification uses English keyword lists, memory extraction uses English regex patterns, entity detection uses English-only \b[A-Z][a-z]+\b, and spellcheck corrupts CJK text. Non-English users (Chinese, French, Japanese, etc.) get incorrect room assignments, no memory type detection, and broken spellcheck.

The MemChinesePalace fork (https://github.com/Chandler-Sun/MemChinesePalace) shows community demand exists — but it's a full rewrite that drops many features (onboarding, entity registry, palace graph, etc.).

What's the proposed solution?

Use embedding-based semantic classification instead of per-language keyword/regex patterns. The multilingual embedding model (paraphrase-multilingual-MiniLM-L12-v2, ~120MB, CPU-only) understands 50+ languages natively.

Concrete changes:

  • Room classification: replace TOPIC_KEYWORDS matching with cosine similarity against room description embeddings. Any language works with zero configuration.
  • Memory extraction: replace per-language regex markers with cosine similarity against memory type description embeddings (decision, preference, milestone, problem, emotional). Each paragraph classified independently to preserve multi-type detection.
  • Entity detection: add Chinese name patterns (百家姓 surnames, simplified+traditional). This stays rule-based because NER is fundamentally pattern matching, not classification.
  • Embedding model: configurable via config.json / env var, default multilingual. sentence-transformers is an optional dependency (pip install mempalace[multilingual]). Core functionality works without it (English regex fallback).
  • Spellcheck: auto-skips non-English text via Unicode detection.
  • All 7 ChromaDB consumer modules (searcher, miner, convo_miner, mcp_server, layers, palace_graph, cli) use a centralized get_embedding_function() from config.py.

Benchmark results — 173 test cases across 8 languages (zh-Hans, zh-Hant, en, fr, es, de, ja, ko):

Base: 122/122 (100%)    Extended: 51/51 (100%)    219 unit tests, 0 regressions

One new optional dependency: sentence-transformers>=2.0 (install via pip install mempalace[multilingual]). Without it, the system falls back to English regex — existing behavior unchanged. No external API calls. Everything runs locally. No GPU needed.

Alternatives considered:

  1. Per-language regex patterns (like MemChinesePalace): Requires ~250 manually curated items per language. Not scalable — adding French/German/Japanese would each need someone to write language-specific keyword lists and regex patterns.

  2. LLM-in-the-loop classification: Use Claude/GPT to classify content. Best accuracy but violates MemPalace's "zero API, local only" principle and adds latency/cost.

  3. External NLP models (spaCy, etc.): Heavier dependency, larger download, and still language-specific (need per-language model). The sentence-transformer approach is lighter and inherently multilingual.

The embedding approach is the sweet spot: zero per-language config, local-only, lightweight (~120MB model), and 100% benchmark accuracy.

I have a working implementation in my fork: https://github.com/EndeavorYen/mempalace , happy to open a PR if the approach looks good. Also open to splitting into smaller PRs or adjusting based on feedback.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/i18nMultilingual, Unicode, non-English embeddingsarea/miningFile and conversation miningarea/searchSearch and retrievalenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions