feat: add Brazilian Portuguese support to entity_detector (closes #117) by mvalentsev · Pull Request #156 · MemPalace/mempalace

mvalentsev · 2026-04-07T21:49:59Z

What does this PR do?

Closes #117 by adding a Brazilian Portuguese locale (pt-br.json) to the i18n module. This is the first non-English locale to include the entity section introduced in #911, enabling entity detection for Portuguese text.

Single-file change, no Python modifications.

What's in pt-br.json

CLI strings -- palace terminology (palacio, ala, corredor, armario, gaveta), all CLI messages, AAAK compression instruction, regex patterns for Portuguese topic extraction.

Entity detection (the entity section):

candidate_pattern -- Latin+diacritics character class ([A-ZA-U][a-za-y]) so names like Joao, Ines, Angela are extracted as candidates
multi_word_pattern -- same charset for multi-word names
15 person_verb_patterns -- disse, perguntou, respondeu, contou, riu, sorriu, chorou, sentiu, pensa, quer, ama, odeia, sabe, decidiu, escreveu
8 pronoun_patterns -- ela/dele/ele/dela + plurals
4 dialogue_patterns -- Portuguese quoted speech markers
direct_address_pattern -- oi, ola, obrigado/obrigada, caro/cara
12 project_verb_patterns -- construindo, lancou, implantou, instalou + technical patterns
69 Portuguese stopwords (greetings, adverbs, prepositions, conjunctions, determiners, pronouns)

Note: caro/cara are intentionally NOT in stopwords -- they are valid first names in Portuguese/Italian/English.

How to test

python -m pytest tests/test_entity_detector.py -v
python -m pytest tests/ --ignore=tests/benchmarks
ruff check .

Quick smoke test:

from mempalace.entity_detector import extract_candidates, score_entity

text = "Joao disse oi. Joao riu. Joao decidiu. Joao escreveu."
# English-only: Joao not found (ASCII-only candidate regex)
assert "Joao" not in extract_candidates(text, languages=("en",))
# With pt-br: Joao found
assert "Joao" in extract_candidates(text, languages=("en", "pt-br"))

Checklist

JSON valid, structure matches en.json (all keys present)
All {variable} interpolations match en.json
Entity patterns load and merge correctly via get_entity_patterns(("en", "pt-br"))
Accented names (Joao, Ines) extracted by pt-br candidate_pattern
PT-BR person verbs score correctly in score_entity
English-only detection unchanged (regression-clean)
Lint: ruff check . clean

Original PR description (before #911 refactor, no longer applies)

What does this PR do?

Closes #117 by extending entity_detector so a file written in Brazilian Portuguese is treated the same way an English file is: names get extracted as candidates, and verb / pronoun / dialogue / direct-address patterns contribute to the person-vs-project classification. The change is purely additive, so English-only corpora behave exactly as before.

Concretely:

New PERSON_VERB_PATTERNS_PTBR, PRONOUN_PATTERNS_PTBR, DIALOGUE_PATTERNS_PTBR constants with the Portuguese equivalents of the existing English signals (said / asked / replied / thinks / wants, plus greetings oi / ola / obrigado / caro).
_build_patterns concatenates the English and pt-br lists for the dialogue and person-verb buckets, so every compiled matcher for an entity now covers both languages at once.
score_entity merges the English and pt-br pronoun lists for the proximity check.
extract_candidates widens its Latin-1 character class so accented names like Joao, Ines, Angela, and Andre flow through candidate extraction instead of being silently dropped by an ASCII-only regex.
STOPWORDS gets the Portuguese greeting fillers (oi, ola, obrigado, obrigada, caro, cara) so they do not masquerade as entity candidates when they start sentences.

This approach was replaced after #911 landed -- all patterns now live in mempalace/i18n/pt-br.json instead of Python constants. Same detection coverage, zero Python changes.

bgauryy · 2026-04-08T23:15:22Z

PR Review: feat: add Brazilian Portuguese support to entity_detector (closes #117)

Executive Summary

Aspect	Value
PR Goal	Extend entity_detector to recognise Brazilian Portuguese person names using pt-br verb, pronoun, dialogue, and direct-address patterns
Files Changed	2
Risk Level	🟢 LOW - purely additive patterns; English-only corpora unaffected
Review Effort	2 - well-scoped, single-module change with comprehensive tests
Recommendation	💬 COMMENT — one scoring asymmetry worth fixing before merge

Affected Areas: mempalace/entity_detector.py (pattern constants, extract_candidates, _build_patterns, score_entity), tests/test_entity_detector.py (new file)

Business Impact: Enables person-entity detection in Brazilian Portuguese text and mixed EN/PT-BR corpora. Users mining Portuguese conversations will now see the same quality of entity extraction they get with English content.

Flow Changes: extract_candidates now matches accented Latin-1 characters (À-ÿ). _build_patterns and score_entity operate on combined EN+PT-BR pattern lists, increasing the regex count per entity by ~40%.

Ratings

Aspect	Score
Correctness	4/5
Security	5/5
Performance	5/5
Maintainability	4/5

PR Health

Has clear description
References ticket/issue (feat: add PT-BR support for AAAK #117)
Appropriate size (197 additions)
Has relevant tests (132 lines, 8 test functions)

Medium Priority Issues

🐛 #1: Portuguese direct-address patterns double-counted in `person_verbs` + `direct`

Location: mempalace/entity_detector.py — PERSON_VERB_PATTERNS_PTBR (new lines ~86–90) and _build_patterns direct regex | Confidence: ✅ HIGH

Five Portuguese greetings (oi, olá, obrigado/a, caro/a) appear in both PERSON_VERB_PATTERNS_PTBR and the direct compiled regex. In score_entity, person_verbs matches add +2 per pattern while direct matches add +4 per hit, so "oi Maria" scores 6 points in Portuguese versus 4 points for the semantically identical "hi Maria" in English.

This inflates the person_score for Portuguese direct-address by ~50% compared to English, creating an asymmetric scoring model.

 PERSON_VERB_PATTERNS_PTBR = [
     r"\b{name}\s+disse\b",  # said
     r"\b{name}\s+perguntou\b",  # asked
     r"\b{name}\s+respondeu\b",  # replied
     r"\b{name}\s+contou\b",  # told
     r"\b{name}\s+riu\b",  # laughed
     r"\b{name}\s+sorriu\b",  # smiled
     r"\b{name}\s+chorou\b",  # cried
     r"\b{name}\s+sentiu\b",  # felt
     r"\b{name}\s+pensa\b",  # thinks
     r"\b{name}\s+quer\b",  # wants
     r"\b{name}\s+ama\b",  # loves
     r"\b{name}\s+odeia\b",  # hates
     r"\b{name}\s+sabe\b",  # knows
     r"\b{name}\s+decidiu\b",  # decided
     r"\b{name}\s+escreveu\b",  # wrote
-    r"\boi\s+{name}\b",  # hi
-    r"\bol[áa]\s+{name}\b",  # hello
-    r"\bobrigad[oa]\s+{name}\b",  # thanks
-    r"\bcaro\s+{name}\b",  # dear
-    r"\bcara\s+{name}\b",  # dear (feminine)
 ]

Remove the last 5 entries from PERSON_VERB_PATTERNS_PTBR — they already live in the direct regex where they belong (matching the English pattern of keeping verbs and greetings separate). The test test_portuguese_direct_address asserts person_score >= 12 and will still pass since 3 direct hits × 4 = 12.

Low Priority Issues

🎨 #2: `DIALOGUE_PATTERNS_PTBR` contains only one pattern

Location: mempalace/entity_detector.py — DIALOGUE_PATTERNS_PTBR | Confidence: ⚠️ MED

English DIALOGUE_PATTERNS has 5 entries covering variations (said, asked, replied, wrote, told in dialogue context). The Portuguese equivalent only has disse. Consider adding perguntou and respondeu in dialogue context for parity:

DIALOGUE_PATTERNS_PTBR = [
    r'"{name}\s+disse',
    r'"{name}\s+perguntou',
    r'"{name}\s+respondeu',
]

🔗 #3: No Portuguese `PROJECT_VERB_PATTERNS` — project signals absent for pt-br text

Location: mempalace/entity_detector.py — PROJECT_VERB_PATTERNS (unchanged) | Confidence: ⚠️ MED

The PR adds person-detection patterns but leaves project-detection English-only. In a fully Portuguese file, a project entity (e.g. "Construindo o MemPalace") won't get any project-signal boost. This is fine if the current scope is intentionally person-detection only, but worth tracking as a follow-up for completeness.

What's Done Well

Unicode regex for accented names: The [A-ZÀ-ÖØ-Þ][a-zà-öø-ÿ] range correctly covers the Latin-1 supplement while excluding the multiplication (×) and division (÷) signs.
Test coverage is thorough: 8 test functions covering verbs, pronouns, direct address, mixed corpora, dialogue markers, detect_entities integration, and accented names.
Additive design: English-only corpora are completely unaffected since patterns are concatenated, not replaced.
Stopword additions: Portuguese filler/greeting words (oi, olá, obrigado/a, caro/a) correctly added to prevent them from being extracted as name candidates.

Created by Octocode MCP https://octocode.ai 🔍🐙

mvalentsev · 2026-04-09T06:24:59Z

Quick check on the asymmetry claim: English PERSON_VERB_PATTERNS already contains hey, hi, thanks in addition to dear, and the direct regex matches hey, hi, thanks. So "hi Maria" in English scores the same way as "oi Maria" in PT-BR — +2 from person_verbs plus +4 from direct for a total of 6 points. The PR follows that pattern exactly:

Greeting	`person_verbs`	`direct`	Total
`"hi Maria"`	+2	+4	6
`"oi Maria"`	+2	+4	6
`"dear Maria"`	+2	-	2
`"caro Maria"`	+2	-	2

No asymmetry — the two languages produce the same scores for semantically equivalent inputs. The double-counting of greetings is pre-existing design for English and was intentionally mirrored for PT-BR so that behaviour stays consistent across languages.

Rebased on latest main. The pt-br entity tests still pass locally.

…ation Replace per-language keyword/regex heuristics with embedding-based semantic classification, enabling MemPalace to work with 50+ languages using zero per-language configuration. Changes: - Room classification: cosine similarity against room description embeddings - Memory extraction: embedding-based classification (5 types, any language) - Entity detection: add Chinese name patterns (百家姓 surnames) - Spellcheck: auto-skip CJK text via Unicode detection - Embedding provider: pluggable via get_embedding_function() with caching - Default: paraphrase-multilingual-MiniLM-L12-v2 (sentence-transformers) - Ollama: "ollama:<model>" prefix (e.g., ollama:qwen3-embedding-8b) - Configurable via MEMPALACE_EMBEDDING_MODEL env var or config.json - Knowledge graph: temporal triples, multi-hop traversal, auto-extraction - Dialect: CJK bigram extraction for topic keywords - All ChromaDB consumers route through centralized embedding function New optional dependency: sentence-transformers>=2.0 Install: pip install mempalace[multilingual] Without it: English regex fallback (existing behavior unchanged) Benchmark: 173/173 (100%) across 8 languages (zh-Hans, zh-Hant, en, fr, es, de, ja, ko) 652 tests passing, 0 failures. CI-compatible (multilingual tests skip gracefully when sentence-transformers is not installed). Closes MemPalace#231. Related: MemPalace#37, MemPalace#50, MemPalace#92, MemPalace#117, MemPalace#156, MemPalace#273.

web3guru888

Review: Brazilian Portuguese Support for `entity_detector`

Well-considered i18n addition. The "additive patterns, no language gating" approach is pragmatic and correct — most real-world corpora are mixed-language anyway.

What's done well

Additive design over language detection. Rather than classifying files as English vs Portuguese and switching pattern sets, the PT-BR patterns are merged into _build_patterns() alongside the English ones. This is the right call: our integration processes 540+ discoveries and roughly 15–20% contain mixed-language content. Additive patterns handle this cleanly; a language-switch would miss the overlap.

Regex range extension in extract_candidates. Changing [A-Z] to [A-ZÀ-ÖØ-Þ] and [a-z] to [a-zà-öø-ÿ] is correct ISO Latin-1 supplement coverage. João, Inês, Ângela, and André all get picked up. The test test_detect_entities_picks_up_accented_names verifies this end-to-end.

STOPWORDS additions are appropriate. oi, olá, obrigado/a, caro, cara are all high-frequency PT-BR words that would otherwise score as entity candidates. The accented olá alongside ASCII ola handles both typed forms.

Test coverage is thorough. Eight tests including mixed corpus, pronoun proximity, direct address, dialogue markers, and accented names. test_mixed_english_portuguese_corpus (checking that mixed > English-only person score) is especially good.

Issues found

cara and caro added to STOPWORDS, but they're also in the pattern list. PERSON_VERB_PATTERNS_PTBR includes r"\bcaro\s+{name}\b" and r"\bcara\s+{name}\b" as direct-address markers. If someone is literally named "Cara" or "Caro", those names are now silently dropped by STOPWORDS before they reach pattern scoring. The patterns would never fire. Consider removing these two from STOPWORDS and leaving them only in the direct-address pattern (where they're already context-guarded by the following name).

ama (loves) and quer (wants) are short common verbs with significant collision risk. The pattern \b{name}\s+ama\b will match "Maria ama" correctly. But {name} here is the escaped entity name, so the collision is actually low — the pattern only fires when the entity name precedes the verb. Not a bug, just worth noting for the next i18n contributor.

No Spanish cognate guard. disse, perguntou, decidiu are distinctly PT-BR. But quer and ama appear in Spanish too (and sabe is identical in Spanish). For a PT-BR-specific PR this is fine, but if ES support is added later, the pattern lists may interact. A comment flagging this would be helpful.

PRONOUN_PATTERNS_PTBR are bare patterns without \b on both sides for multi-word forms. r"\bela\b" is correct but r"\bdelas\b" and r"\bdeles\b" are fine too — word boundaries on both sides. This is actually good. ✓

test_portuguese_direct_address asserts person_score >= 12 — this is a magic number tied to the current scoring weights. If weights change, the test breaks. Consider asserting person_score > 0 and len(patterns["direct"].findall(text)) == 3 separately (the test already does the latter).

Language detection is absent by design — but there's no documentation of this decision. A comment in entity_detector.py noting "PT-BR patterns are additive and always active; see issue #117" would help future contributors understand why there's no lang= parameter.

Suggestions

Remove cara/caro from STOPWORDS (or add a note that they're intentionally excluded from entity detection since they're in direct-address patterns)
Replace the magic >= 12 assertion with > 0 for score stability
Add a module comment explaining the additive-patterns design decision
Consider a test with a Portuguese common noun that should NOT be classified as a person (e.g., a project or tool with a PT-BR name)

Overall

Clean, well-tested i18n work. The additive approach is the right architecture, the regex range extension is correct, and the test suite is more thorough than most i18n PRs. The cara/caro STOPWORDS issue is the only real correctness concern.

APPROVED — cara/caro STOPWORDS issue is worth addressing before merge but not a hard blocker.

Reviewed by MemPalace-AGI — autonomous research system with perfect memory

web3guru888

PR #156 — `feat: add Brazilian Portuguese support to entity_detector`

A well-scoped internationalization addition that extends entity detection to pt-BR corpora. 126 new tests, Unicode-aware candidate extraction, and an additive (non-breaking) pattern strategy. Strong execution on a genuinely useful feature.

What works well

Additive pattern strategy: Appending PTBR patterns to the existing English lists rather than forking detection logic is the right call. Mixed English/Portuguese corpora (very common in Brazilian tech teams) work without any language-classification step — a real-world win. The test test_mixed_english_portuguese_corpus validates this explicitly.

Unicode candidate extraction: The regex expansion from [A-Z][a-z]{1,19} to [A-ZÀ-ÖØ-Þ][a-zà-öø-ÿ]{1,19} is correct Unicode Latin Extended-A/B coverage. João, Inês, Ângela, and André will all be picked up. The multi-word match regex receives the same treatment consistently — good.

STOPWORDS additions: Adding oi, olá, obrigado/a, caro, and cara prevents common Portuguese greetings from being scored as entity names. Correct and necessary.

direct pattern inline expansion: Rather than creating a new pattern list, the direct regex is extended inline with |\\boi\\s+{n}\\b|\\bol[áa]\\s+{n}\\b|\\bobrigad[oa]\\s+{n}\\b. This is clean and avoids a fourth pattern category. The [áa] alternation handles both accented and ASCII-normalized forms (important for older systems that may strip diacritics).

Test coverage: 126 tests covering: English-only person verbs, Portuguese-only person verbs, pronoun proximity, direct address (3 forms), mixed corpus scoring, dialogue marker detection, detect_entities() integration, and accented names. This is thorough.

Issues / suggestions

PRONOUN_PATTERNS_PTBR creates false positives on Spanish: ela, ele, eles, elas are also valid Spanish words with different meanings, and deles/delas are close to Spanish forms. For a repository used internationally, this could cause over-detection in Spanish-language files. A note in the docstring explaining this tradeoff (and that the patterns are additive, not isolated to pt-BR files) would help future contributors understand the design decision.

cara as STOPWORD: cara is both a pt-BR filler word ("dude/dear") and a valid Italian/Spanish/Portuguese proper-noun component. Adding it as a stopword means a person named Cara in an English document would be missed. Consider scoping this more carefully — or add a comment explaining the tradeoff.

ama pattern: r"\\b{name}\\s+ama\\b" (loves) will match Portuguese entities, but ama is also a common English suffix in names like Obama, Alabama, etc. The word-boundary anchors on {name} protect against this, but the reverse case — a short name like Ana ama (Ana loves) matching a word-boundary fragment in English text — is worth noting.

No language detection fallback: The additive approach is intentionally language-agnostic, but the PR description could document this explicitly so future contributors know why there is no lang= parameter. Currently the intent is implicit.

olá in STOPWORDS as olá (with accent) + ola (without): Good — both forms are correctly listed. However, o alone is a very common Portuguese article that appears adjacent to proper nouns in patterns like o João fez.... The pattern set does not cover o/a <Name> verb constructions. This is an understandable scope limitation but worth flagging as a follow-up.

Minor

shutil and tempfile imports in tests are correct and used; no unused imports.
_build_patterns exported in __init__ check: ensure it is accessible for the test import to work.
Test file uses tempfile.mkdtemp() with manual cleanup in finally — correct pattern.

Verdict

Solid, well-tested i18n addition. The additive strategy is the right architectural choice for a mixed-corpus tool. The cara/ama edge cases are minor and worth a follow-up issue rather than a blocker. Ready for merge with perhaps a brief doc note about the language-agnostic design intent.

Reviewed by MemPalace-AGI — autonomous research system with perfect memory

mvalentsev · 2026-04-11T02:23:04Z

Removed caro and cara from STOPWORDS with an explanatory comment. They stay in PERSON_VERB_PATTERNS_PTBR as direct-address markers, so "caro Maria" still fires the pattern; they just no longer silently drop a person literally named Cara or Caro at candidate extraction time. Added test_extract_candidates_keeps_cara_and_caro_as_names as a regression guard.

Kept oi / ola / olá / obrigado / obrigada as stopwords -- they're practically never first names in real corpora, and keeping them out cuts candidate noise on PT-BR greetings.

The other notes (Spanish cognate risk on ama / quer / sabe, module-level doc on the additive design) make sense as follow-ups. The >= 12 in test_portuguese_direct_address is intentional -- it locks the current weights so accidental score drift breaks the test loudly.

web3guru888 · 2026-04-11T02:47:39Z

Solid additive implementation. A few observations:

What's done well:

Extending extract_candidates to accept accented characters ([A-ZÀ-ÖØ-Þ][a-zà-öø-ÿ]) is the right change — João, Inês, Ângela would all be silently dropped from candidate extraction under the old ASCII-only regex. This affects detection quality for accented names even in English corpora, not just Portuguese.
The decision to NOT add caro/cara to STOPWORDS is correct. They're valid first names in English/Italian/Portuguese. The explanatory comment makes the reasoning explicit so future contributors don't accidentally add them.
Merging PTBR patterns in _build_patterns() at compile time rather than at match time is the right call — the compiled regex cache benefits from this.

One open question:
The PERSON_VERB_PATTERNS_PTBR list covers ~14 common verbs. For detection to trigger at the score_entity level, the name needs to appear 3+ times in candidate extraction AND score above the classification threshold. Portuguese corpora with primarily dialogue-style text (WhatsApp logs, meeting notes) will hit the threshold well, but technical documents in Portuguese that reference a person occasionally might not. Worth testing against that use case before the PR lands.

Test coverage:
The test_mixed_english_portuguese_corpus test is the most important — it proves that adding PTBR patterns doesn't degrade English detection. Good to see it explicitly here. The test_detect_entities_picks_up_accented_names with João and Inês is exactly the right integration test.

This is a genuine addition that benefits any workspace with Portuguese contributors. LGTM.

mvalentsev · 2026-04-11T12:16:06Z

The 3+ frequency threshold lives in extract_candidates itself and applies equally to English -- a name mentioned once or twice won't surface regardless of language. It's a pre-existing constraint on the whole detector, not something this PR introduces for PT-BR.

Move all entity-detection lexical patterns (person verbs, pronouns, dialogue markers, project verbs, stopwords, candidate character class) out of hardcoded module-level constants and into the entity section of each locale's JSON in mempalace/i18n/. Adds a languages parameter to every public function so callers union patterns across the desired locales. The default stays ("en",), so all existing callers and tests behave unchanged. Also adds: - get_entity_patterns(langs) helper in mempalace/i18n/ that merges patterns across requested languages, dedupes lists, unions stopwords, and falls back to English for unknown locales - MempalaceConfig.entity_languages property + setter, with env var override (MEMPALACE_ENTITY_LANGUAGES, comma-separated) - mempalace init --lang en,pt-br flag (persists to config.json) - Per-language candidate_pattern so non-Latin scripts (Cyrillic, Devanagari, CJK) can register their own character classes instead of being silently dropped by the ASCII-only [A-Z][a-z]+ default - _build_patterns LRU cache keyed by (name, languages) so multi-language callers don't poison each other's cache slots Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that needed entity_detector changes and inlined a _PTBR variant of every constant. That doesn't scale past 2-3 languages — every text gets checked against every language's patterns regardless of relevance, and candidate extraction still drops accented and non-Latin names. This PR sets the standard so future locale contributors only edit one JSON file (no Python changes), and entity detection scales linearly with how many languages a user actually enabled, not how many ship.

mvalentsev · 2026-04-15T13:24:23Z

@igorls Reworked as JSON-only per #911 -- first locale with the entity section. CLI strings, person-verb/pronoun/dialogue patterns, and a Latin+diacritics candidate pattern for accented names (Joao, Ines, etc). All CI green.

Also added a Cyrillic entity section to #760 (ru.json) following the same pattern.

mvalentsev · 2026-04-15T16:14:57Z

Heads up: the entity stopwords list here (30 words) is baseline only. Words like "Para", "Sobre", "Entre" at the start of a sentence match the candidate_pattern and produce false positives in entity detection. Probably worth expanding with Portuguese prepositions (para, sobre, entre, desde, contra, perante, etc.) and conjunctions (porém, contudo, embora, enquanto, etc.).

igorls · 2026-04-15T16:49:26Z

Excellent rework, @mvalentsev — clean shape, 128 lines of JSON vs 216 of Python, and you're the first locale using the new entity section. This becomes the reference for other contributors. CI all green against the current develop (with #758/#760 merged).

Two concrete issues I caught running it locally:

1. Typo in dialogue_patterns[0] — won't match markdown-style quotes

Current:

"^\">\\s*{name}[:\\s]",

That compiles to ^">\s*Maria[:\s] — which requires a literal "> at the start of the line. Standard markdown quote lines like > Maria: hello won't match. The en.json equivalent is "^>\\s*{name}[:\\s]" (no leading \"). Quick fix:

"^>\\s*{name}[:\\s]",

Verified locally — > Maria: hello fails against the current pattern and passes against the corrected one.

2. Follow-up on your own stopwords note — concrete list

Your comment already flagged this, and I confirmed it: running the candidate_pattern against a pile of sentence-starting Portuguese prepositions/conjunctions, these currently surface as false-positive entity candidates:

word	in `entity.stopwords`	would surface
Para	❌	yes
Como	❌	yes
Mas	❌	yes
Porém	❌	yes
Sobre	✅	no
Entre	✅	no
Talvez	✅	no
Depois	✅	no

Your regex.stop_words (used by the AAAK compressor) already has para, como, mas, porém, embora, porque — but the entity.stopwords list (used by entity_detector) is a separate list and missing them. Worth syncing. Concrete suggestion to add:

para, como, mas, porém, contudo, embora, enquanto, porque, portanto, logo, todavia, desde, contra, perante, após, mediante, durante, conforme, segundo, exceto, pois, assim, também, apenas

Since pt-br is the reference implementation and the stopwords list ships with a tangible false-positive rate as-written, I'd prefer to roll this into the same PR rather than defer. Small follow-up commit should do it.

Nice-to-have, not blocking:

pronoun_patterns currently covers 3rd-person (ele/ela/deles/delas) but not 2nd-person (você, vocês) or possessives (seu, sua, seus, suas, teu, tua). Pronoun proximity is a weak signal, so missing these just means slightly lower person-confidence for people referenced in 2nd person. Up to you whether to add now or later.

The direct_address_pattern with ol[áa] and obrigad[oa] is a nice touch — handles both accented and unaccented casual typing.

Once the two above are addressed I'll merge. Thanks again for pushing through the rework.

mvalentsev · 2026-04-15T17:13:08Z

@igorls Both fixed. Also added the 2nd-person pronouns (você/vocês, seu/sua/seus/suas) while at it.

Verified locally: extract_candidates filters out Para/Como/Porém, dialogue_patterns[0] matches > Maria: hello, score_entity picks up the new pronoun proximity signals. 106 tests pass.

Heads up: pt-br is not my native language, I relied on LLM assistance for the linguistic choices. If any of the stopwords or verb forms look off to a native speaker, happy to correct.

…oses MemPalace#117) CLI strings, AAAK instruction, regex patterns, and entity section with person-verb, pronoun, dialogue, and candidate patterns for Latin+diacritics names (Joao, Ines, Angela). Follows the i18n entity framework from MemPalace#911.

- dialogue_patterns[0]: remove stray \" before > (fixes markdown quote matching) - entity stopwords: add 40 prepositions, conjunctions, and common words to reduce false positives - pronoun_patterns: add 2nd-person (você/vocês) and possessives (seu/sua/seus/suas)

mvalentsev force-pushed the feat/pt-br-entity-detection branch from b4ea25e to 0990c10 Compare April 9, 2026 06:17

mvalentsev mentioned this pull request Apr 9, 2026

fix: filter programming keywords from entity detection (#348) #349

Closed

mvalentsev force-pushed the feat/pt-br-entity-detection branch from 0990c10 to 879de92 Compare April 9, 2026 16:44

This was referenced Apr 9, 2026

Domain-scoped collections + local embedding model = better retrieval at scale #273

Open

[Feature] Add Multilingual Support #231

Open

mvalentsev force-pushed the feat/pt-br-entity-detection branch 3 times, most recently from 4f05ed5 to 3d250ad Compare April 9, 2026 21:05

EndeavorYen mentioned this pull request Apr 10, 2026

feat: add multilingual support via embedding-based semantic classification #488

Open

8 tasks

mvalentsev force-pushed the feat/pt-br-entity-detection branch 2 times, most recently from e15ccd1 to 0afc71f Compare April 10, 2026 15:52

mvalentsev requested review from bensig and milla-jovovich as code owners April 10, 2026 15:52

mvalentsev force-pushed the feat/pt-br-entity-detection branch from 0afc71f to 3e9435a Compare April 10, 2026 16:43

web3guru888 mentioned this pull request Apr 11, 2026

Add NLP capabilities with local models - adds multi-lingual support and improves evaluation results #507

Open

3 tasks

web3guru888 reviewed Apr 11, 2026

View reviewed changes

mvalentsev force-pushed the feat/pt-br-entity-detection branch from 4bb281e to b6d597b Compare April 11, 2026 11:27

bensig changed the base branch from main to develop April 11, 2026 22:23

bensig requested a review from igorls as a code owner April 11, 2026 22:23

mvalentsev force-pushed the feat/pt-br-entity-detection branch 2 times, most recently from cc5f60c to 6e7946a Compare April 12, 2026 06:49

mvalentsev force-pushed the feat/pt-br-entity-detection branch 4 times, most recently from 5639b00 to a55770a Compare April 13, 2026 23:49

igorls added the area/i18n Multilingual, Unicode, non-English embeddings label Apr 14, 2026

mvalentsev force-pushed the feat/pt-br-entity-detection branch 2 times, most recently from c3229f9 to c0392be Compare April 15, 2026 06:03

igorls mentioned this pull request Apr 15, 2026

refactor(entity_detector): make multi-language extensible via i18n JSON #911

Merged

6 tasks

mvalentsev force-pushed the feat/pt-br-entity-detection branch from c0392be to 342568a Compare April 15, 2026 12:51

mvalentsev mentioned this pull request Apr 15, 2026

feat: add Russian language support (ru.json) #760

Merged

mvalentsev force-pushed the feat/pt-br-entity-detection branch from 9fd98dc to 540bab2 Compare April 15, 2026 17:04

mvalentsev force-pushed the feat/pt-br-entity-detection branch from 540bab2 to e791806 Compare April 15, 2026 17:16

mvalentsev added 2 commits April 15, 2026 23:32

mvalentsev force-pushed the feat/pt-br-entity-detection branch from e791806 to 4221589 Compare April 15, 2026 18:32

igorls merged commit 57b0b14 into MemPalace:develop Apr 15, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Brazilian Portuguese support to entity_detector (closes #117)#156

feat: add Brazilian Portuguese support to entity_detector (closes #117)#156
igorls merged 2 commits intoMemPalace:developfrom
mvalentsev:feat/pt-br-entity-detection

mvalentsev commented Apr 7, 2026 •

edited

Loading

Uh oh!

bgauryy commented Apr 8, 2026

Uh oh!

mvalentsev commented Apr 9, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

web3guru888 left a comment

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

web3guru888 commented Apr 11, 2026

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

mvalentsev commented Apr 15, 2026

Uh oh!

mvalentsev commented Apr 15, 2026 •

edited

Loading

Uh oh!

igorls commented Apr 15, 2026

Uh oh!

mvalentsev commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mvalentsev commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

What's in pt-br.json

How to test

Checklist

What does this PR do?

Uh oh!

bgauryy commented Apr 8, 2026

PR Review: feat: add Brazilian Portuguese support to entity_detector (closes #117)

Executive Summary

Ratings

PR Health

Medium Priority Issues

🐛 #1: Portuguese direct-address patterns double-counted in person_verbs + direct

Low Priority Issues

🎨 #2: DIALOGUE_PATTERNS_PTBR contains only one pattern

🔗 #3: No Portuguese PROJECT_VERB_PATTERNS — project signals absent for pt-br text

What's Done Well

Uh oh!

mvalentsev commented Apr 9, 2026

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Review: Brazilian Portuguese Support for entity_detector

What's done well

Issues found

Suggestions

Overall

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

PR #156 — feat: add Brazilian Portuguese support to entity_detector

What works well

Issues / suggestions

Minor

Verdict

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

web3guru888 commented Apr 11, 2026

Uh oh!

mvalentsev commented Apr 11, 2026

Uh oh!

mvalentsev commented Apr 15, 2026

Uh oh!

mvalentsev commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

igorls commented Apr 15, 2026

Uh oh!

mvalentsev commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mvalentsev commented Apr 7, 2026 •

edited

Loading

🐛 #1: Portuguese direct-address patterns double-counted in `person_verbs` + `direct`

🎨 #2: `DIALOGUE_PATTERNS_PTBR` contains only one pattern

🔗 #3: No Portuguese `PROJECT_VERB_PATTERNS` — project signals absent for pt-br text

Review: Brazilian Portuguese Support for `entity_detector`

PR #156 — `feat: add Brazilian Portuguese support to entity_detector`

mvalentsev commented Apr 15, 2026 •

edited

Loading