feat: add Brazilian Portuguese support to entity_detector (closes #117)#156
Conversation
PR Review: feat: add Brazilian Portuguese support to entity_detector (closes #117)Executive Summary
Affected Areas: Business Impact: Enables person-entity detection in Brazilian Portuguese text and mixed EN/PT-BR corpora. Users mining Portuguese conversations will now see the same quality of entity extraction they get with English content. Flow Changes: Ratings
PR Health
Medium Priority Issues🐛 #1: Portuguese direct-address patterns double-counted in
|
b4ea25e to
0990c10
Compare
|
Quick check on the asymmetry claim: English
No asymmetry — the two languages produce the same scores for semantically equivalent inputs. The double-counting of greetings is pre-existing design for English and was intentionally mirrored for PT-BR so that behaviour stays consistent across languages. Rebased on latest main. The pt-br entity tests still pass locally. |
0990c10 to
879de92
Compare
4f05ed5 to
3d250ad
Compare
…ation Replace per-language keyword/regex heuristics with embedding-based semantic classification, enabling MemPalace to work with 50+ languages using zero per-language configuration. Changes: - Room classification: cosine similarity against room description embeddings - Memory extraction: embedding-based classification (5 types, any language) - Entity detection: add Chinese name patterns (百家姓 surnames) - Spellcheck: auto-skip CJK text via Unicode detection - Embedding provider: pluggable via get_embedding_function() with caching - Default: paraphrase-multilingual-MiniLM-L12-v2 (sentence-transformers) - Ollama: "ollama:<model>" prefix (e.g., ollama:qwen3-embedding-8b) - Configurable via MEMPALACE_EMBEDDING_MODEL env var or config.json - Knowledge graph: temporal triples, multi-hop traversal, auto-extraction - Dialect: CJK bigram extraction for topic keywords - All ChromaDB consumers route through centralized embedding function New optional dependency: sentence-transformers>=2.0 Install: pip install mempalace[multilingual] Without it: English regex fallback (existing behavior unchanged) Benchmark: 173/173 (100%) across 8 languages (zh-Hans, zh-Hant, en, fr, es, de, ja, ko) 652 tests passing, 0 failures. CI-compatible (multilingual tests skip gracefully when sentence-transformers is not installed). Closes MemPalace#231. Related: MemPalace#37, MemPalace#50, MemPalace#92, MemPalace#117, MemPalace#156, MemPalace#273.
…ation Replace per-language keyword/regex heuristics with embedding-based semantic classification, enabling MemPalace to work with 50+ languages using zero per-language configuration. Changes: - Room classification: cosine similarity against room description embeddings - Memory extraction: embedding-based classification (5 types, any language) - Entity detection: add Chinese name patterns (百家姓 surnames) - Spellcheck: auto-skip CJK text via Unicode detection - Embedding provider: pluggable via get_embedding_function() with caching - Default: paraphrase-multilingual-MiniLM-L12-v2 (sentence-transformers) - Ollama: "ollama:<model>" prefix (e.g., ollama:qwen3-embedding-8b) - Configurable via MEMPALACE_EMBEDDING_MODEL env var or config.json - Knowledge graph: temporal triples, multi-hop traversal, auto-extraction - Dialect: CJK bigram extraction for topic keywords - All ChromaDB consumers route through centralized embedding function New optional dependency: sentence-transformers>=2.0 Install: pip install mempalace[multilingual] Without it: English regex fallback (existing behavior unchanged) Benchmark: 173/173 (100%) across 8 languages (zh-Hans, zh-Hant, en, fr, es, de, ja, ko) 652 tests passing, 0 failures. CI-compatible (multilingual tests skip gracefully when sentence-transformers is not installed). Closes MemPalace#231. Related: MemPalace#37, MemPalace#50, MemPalace#92, MemPalace#117, MemPalace#156, MemPalace#273.
…ation Replace per-language keyword/regex heuristics with embedding-based semantic classification, enabling MemPalace to work with 50+ languages using zero per-language configuration. Changes: - Room classification: cosine similarity against room description embeddings - Memory extraction: embedding-based classification (5 types, any language) - Entity detection: add Chinese name patterns (百家姓 surnames) - Spellcheck: auto-skip CJK text via Unicode detection - Embedding provider: pluggable via get_embedding_function() with caching - Default: paraphrase-multilingual-MiniLM-L12-v2 (sentence-transformers) - Ollama: "ollama:<model>" prefix (e.g., ollama:qwen3-embedding-8b) - Configurable via MEMPALACE_EMBEDDING_MODEL env var or config.json - Knowledge graph: temporal triples, multi-hop traversal, auto-extraction - Dialect: CJK bigram extraction for topic keywords - All ChromaDB consumers route through centralized embedding function New optional dependency: sentence-transformers>=2.0 Install: pip install mempalace[multilingual] Without it: English regex fallback (existing behavior unchanged) Benchmark: 173/173 (100%) across 8 languages (zh-Hans, zh-Hant, en, fr, es, de, ja, ko) 652 tests passing, 0 failures. CI-compatible (multilingual tests skip gracefully when sentence-transformers is not installed). Closes MemPalace#231. Related: MemPalace#37, MemPalace#50, MemPalace#92, MemPalace#117, MemPalace#156, MemPalace#273.
e15ccd1 to
0afc71f
Compare
0afc71f to
3e9435a
Compare
web3guru888
left a comment
There was a problem hiding this comment.
Review: Brazilian Portuguese Support for entity_detector
Well-considered i18n addition. The "additive patterns, no language gating" approach is pragmatic and correct — most real-world corpora are mixed-language anyway.
What's done well
Additive design over language detection. Rather than classifying files as English vs Portuguese and switching pattern sets, the PT-BR patterns are merged into _build_patterns() alongside the English ones. This is the right call: our integration processes 540+ discoveries and roughly 15–20% contain mixed-language content. Additive patterns handle this cleanly; a language-switch would miss the overlap.
Regex range extension in extract_candidates. Changing [A-Z] to [A-ZÀ-ÖØ-Þ] and [a-z] to [a-zà-öø-ÿ] is correct ISO Latin-1 supplement coverage. João, Inês, Ângela, and André all get picked up. The test test_detect_entities_picks_up_accented_names verifies this end-to-end.
STOPWORDS additions are appropriate. oi, olá, obrigado/a, caro, cara are all high-frequency PT-BR words that would otherwise score as entity candidates. The accented olá alongside ASCII ola handles both typed forms.
Test coverage is thorough. Eight tests including mixed corpus, pronoun proximity, direct address, dialogue markers, and accented names. test_mixed_english_portuguese_corpus (checking that mixed > English-only person score) is especially good.
Issues found
cara and caro added to STOPWORDS, but they're also in the pattern list. PERSON_VERB_PATTERNS_PTBR includes r"\bcaro\s+{name}\b" and r"\bcara\s+{name}\b" as direct-address markers. If someone is literally named "Cara" or "Caro", those names are now silently dropped by STOPWORDS before they reach pattern scoring. The patterns would never fire. Consider removing these two from STOPWORDS and leaving them only in the direct-address pattern (where they're already context-guarded by the following name).
ama (loves) and quer (wants) are short common verbs with significant collision risk. The pattern \b{name}\s+ama\b will match "Maria ama" correctly. But {name} here is the escaped entity name, so the collision is actually low — the pattern only fires when the entity name precedes the verb. Not a bug, just worth noting for the next i18n contributor.
No Spanish cognate guard. disse, perguntou, decidiu are distinctly PT-BR. But quer and ama appear in Spanish too (and sabe is identical in Spanish). For a PT-BR-specific PR this is fine, but if ES support is added later, the pattern lists may interact. A comment flagging this would be helpful.
PRONOUN_PATTERNS_PTBR are bare patterns without \b on both sides for multi-word forms. r"\bela\b" is correct but r"\bdelas\b" and r"\bdeles\b" are fine too — word boundaries on both sides. This is actually good. ✓
test_portuguese_direct_address asserts person_score >= 12 — this is a magic number tied to the current scoring weights. If weights change, the test breaks. Consider asserting person_score > 0 and len(patterns["direct"].findall(text)) == 3 separately (the test already does the latter).
Language detection is absent by design — but there's no documentation of this decision. A comment in entity_detector.py noting "PT-BR patterns are additive and always active; see issue #117" would help future contributors understand why there's no lang= parameter.
Suggestions
- Remove
cara/carofrom STOPWORDS (or add a note that they're intentionally excluded from entity detection since they're in direct-address patterns) - Replace the magic
>= 12assertion with> 0for score stability - Add a module comment explaining the additive-patterns design decision
- Consider a test with a Portuguese common noun that should NOT be classified as a person (e.g., a project or tool with a PT-BR name)
Overall
Clean, well-tested i18n work. The additive approach is the right architecture, the regex range extension is correct, and the test suite is more thorough than most i18n PRs. The cara/caro STOPWORDS issue is the only real correctness concern.
APPROVED — cara/caro STOPWORDS issue is worth addressing before merge but not a hard blocker.
Reviewed by MemPalace-AGI — autonomous research system with perfect memory
web3guru888
left a comment
There was a problem hiding this comment.
PR #156 — feat: add Brazilian Portuguese support to entity_detector
A well-scoped internationalization addition that extends entity detection to pt-BR corpora. 126 new tests, Unicode-aware candidate extraction, and an additive (non-breaking) pattern strategy. Strong execution on a genuinely useful feature.
What works well
Additive pattern strategy: Appending PTBR patterns to the existing English lists rather than forking detection logic is the right call. Mixed English/Portuguese corpora (very common in Brazilian tech teams) work without any language-classification step — a real-world win. The test test_mixed_english_portuguese_corpus validates this explicitly.
Unicode candidate extraction: The regex expansion from [A-Z][a-z]{1,19} to [A-ZÀ-ÖØ-Þ][a-zà-öø-ÿ]{1,19} is correct Unicode Latin Extended-A/B coverage. João, Inês, Ângela, and André will all be picked up. The multi-word match regex receives the same treatment consistently — good.
STOPWORDS additions: Adding oi, olá, obrigado/a, caro, and cara prevents common Portuguese greetings from being scored as entity names. Correct and necessary.
direct pattern inline expansion: Rather than creating a new pattern list, the direct regex is extended inline with |\\boi\\s+{n}\\b|\\bol[áa]\\s+{n}\\b|\\bobrigad[oa]\\s+{n}\\b. This is clean and avoids a fourth pattern category. The [áa] alternation handles both accented and ASCII-normalized forms (important for older systems that may strip diacritics).
Test coverage: 126 tests covering: English-only person verbs, Portuguese-only person verbs, pronoun proximity, direct address (3 forms), mixed corpus scoring, dialogue marker detection, detect_entities() integration, and accented names. This is thorough.
Issues / suggestions
PRONOUN_PATTERNS_PTBR creates false positives on Spanish: ela, ele, eles, elas are also valid Spanish words with different meanings, and deles/delas are close to Spanish forms. For a repository used internationally, this could cause over-detection in Spanish-language files. A note in the docstring explaining this tradeoff (and that the patterns are additive, not isolated to pt-BR files) would help future contributors understand the design decision.
cara as STOPWORD: cara is both a pt-BR filler word ("dude/dear") and a valid Italian/Spanish/Portuguese proper-noun component. Adding it as a stopword means a person named Cara in an English document would be missed. Consider scoping this more carefully — or add a comment explaining the tradeoff.
ama pattern: r"\\b{name}\\s+ama\\b" (loves) will match Portuguese entities, but ama is also a common English suffix in names like Obama, Alabama, etc. The word-boundary anchors on {name} protect against this, but the reverse case — a short name like Ana ama (Ana loves) matching a word-boundary fragment in English text — is worth noting.
No language detection fallback: The additive approach is intentionally language-agnostic, but the PR description could document this explicitly so future contributors know why there is no lang= parameter. Currently the intent is implicit.
olá in STOPWORDS as olá (with accent) + ola (without): Good — both forms are correctly listed. However, o alone is a very common Portuguese article that appears adjacent to proper nouns in patterns like o João fez.... The pattern set does not cover o/a <Name> verb constructions. This is an understandable scope limitation but worth flagging as a follow-up.
Minor
shutilandtempfileimports in tests are correct and used; no unused imports._build_patternsexported in__init__check: ensure it is accessible for the test import to work.- Test file uses
tempfile.mkdtemp()with manual cleanup infinally— correct pattern.
Verdict
Solid, well-tested i18n addition. The additive strategy is the right architectural choice for a mixed-corpus tool. The cara/ama edge cases are minor and worth a follow-up issue rather than a blocker. Ready for merge with perhaps a brief doc note about the language-agnostic design intent.
Reviewed by MemPalace-AGI — autonomous research system with perfect memory
|
Removed Kept The other notes (Spanish cognate risk on |
|
Solid additive implementation. A few observations: What's done well:
One open question: Test coverage: This is a genuine addition that benefits any workspace with Portuguese contributors. LGTM. |
4bb281e to
b6d597b
Compare
|
The 3+ frequency threshold lives in |
cc5f60c to
6e7946a
Compare
5639b00 to
a55770a
Compare
c3229f9 to
c0392be
Compare
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.
Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
patterns across requested languages, dedupes lists, unions stopwords,
and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
Devanagari, CJK) can register their own character classes instead of
being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
callers don't poison each other's cache slots
Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.
This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
c0392be to
342568a
Compare
|
@igorls Reworked as JSON-only per #911 -- first locale with the entity section. CLI strings, person-verb/pronoun/dialogue patterns, and a Latin+diacritics candidate pattern for accented names (Joao, Ines, etc). All CI green. Also added a Cyrillic entity section to #760 (ru.json) following the same pattern. |
|
Heads up: the entity stopwords list here (30 words) is baseline only. Words like "Para", "Sobre", "Entre" at the start of a sentence match the candidate_pattern and produce false positives in entity detection. Probably worth expanding with Portuguese prepositions (para, sobre, entre, desde, contra, perante, etc.) and conjunctions (porém, contudo, embora, enquanto, etc.). |
|
Excellent rework, @mvalentsev — clean shape, 128 lines of JSON vs 216 of Python, and you're the first locale using the new entity section. This becomes the reference for other contributors. CI all green against the current develop (with #758/#760 merged). Two concrete issues I caught running it locally: 1. Typo in Current: "^\">\\s*{name}[:\\s]",That compiles to "^>\\s*{name}[:\\s]",Verified locally — 2. Follow-up on your own stopwords note — concrete list Your comment already flagged this, and I confirmed it: running the
Your Since pt-br is the reference implementation and the stopwords list ships with a tangible false-positive rate as-written, I'd prefer to roll this into the same PR rather than defer. Small follow-up commit should do it. Nice-to-have, not blocking:
The Once the two above are addressed I'll merge. Thanks again for pushing through the rework. |
9fd98dc to
540bab2
Compare
|
@igorls Both fixed. Also added the 2nd-person pronouns (você/vocês, seu/sua/seus/suas) while at it. Verified locally: Heads up: pt-br is not my native language, I relied on LLM assistance for the linguistic choices. If any of the stopwords or verb forms look off to a native speaker, happy to correct. |
540bab2 to
e791806
Compare
…oses MemPalace#117) CLI strings, AAAK instruction, regex patterns, and entity section with person-verb, pronoun, dialogue, and candidate patterns for Latin+diacritics names (Joao, Ines, Angela). Follows the i18n entity framework from MemPalace#911.
- dialogue_patterns[0]: remove stray \" before > (fixes markdown quote matching) - entity stopwords: add 40 prepositions, conjunctions, and common words to reduce false positives - pronoun_patterns: add 2nd-person (você/vocês) and possessives (seu/sua/seus/suas)
e791806 to
4221589
Compare
What does this PR do?
Closes #117 by adding a Brazilian Portuguese locale (
pt-br.json) to the i18n module. This is the first non-English locale to include theentitysection introduced in #911, enabling entity detection for Portuguese text.Single-file change, no Python modifications.
What's in pt-br.json
CLI strings -- palace terminology (palacio, ala, corredor, armario, gaveta), all CLI messages, AAAK compression instruction, regex patterns for Portuguese topic extraction.
Entity detection (the
entitysection):candidate_pattern-- Latin+diacritics character class ([A-ZA-U][a-za-y]) so names like Joao, Ines, Angela are extracted as candidatesmulti_word_pattern-- same charset for multi-word namesperson_verb_patterns-- disse, perguntou, respondeu, contou, riu, sorriu, chorou, sentiu, pensa, quer, ama, odeia, sabe, decidiu, escreveupronoun_patterns-- ela/dele/ele/dela + pluralsdialogue_patterns-- Portuguese quoted speech markersdirect_address_pattern-- oi, ola, obrigado/obrigada, caro/caraproject_verb_patterns-- construindo, lancou, implantou, instalou + technical patternsstopwords(greetings, adverbs, prepositions, conjunctions, determiners, pronouns)Note:
caro/caraare intentionally NOT in stopwords -- they are valid first names in Portuguese/Italian/English.How to test
python -m pytest tests/test_entity_detector.py -v python -m pytest tests/ --ignore=tests/benchmarks ruff check .Quick smoke test:
Checklist
get_entity_patterns(("en", "pt-br"))score_entityruff check .cleanOriginal PR description (before #911 refactor, no longer applies)
What does this PR do?
Closes #117 by extending
entity_detectorso a file written in Brazilian Portuguese is treated the same way an English file is: names get extracted as candidates, and verb / pronoun / dialogue / direct-address patterns contribute to the person-vs-project classification. The change is purely additive, so English-only corpora behave exactly as before.Concretely:
PERSON_VERB_PATTERNS_PTBR,PRONOUN_PATTERNS_PTBR,DIALOGUE_PATTERNS_PTBRconstants with the Portuguese equivalents of the existing English signals (said/asked/replied/thinks/wants, plus greetingsoi/ola/obrigado/caro)._build_patternsconcatenates the English and pt-br lists for the dialogue and person-verb buckets, so every compiled matcher for an entity now covers both languages at once.score_entitymerges the English and pt-br pronoun lists for the proximity check.extract_candidateswidens its Latin-1 character class so accented names likeJoao,Ines,Angela, andAndreflow through candidate extraction instead of being silently dropped by an ASCII-only regex.STOPWORDSgets the Portuguese greeting fillers (oi,ola,obrigado,obrigada,caro,cara) so they do not masquerade as entity candidates when they start sentences.This approach was replaced after #911 landed -- all patterns now live in
mempalace/i18n/pt-br.jsoninstead of Python constants. Same detection coverage, zero Python changes.