Skip to content

feat: add Brazilian Portuguese support to entity_detector (closes #117)#156

Merged
igorls merged 2 commits intoMemPalace:developfrom
mvalentsev:feat/pt-br-entity-detection
Apr 15, 2026
Merged

feat: add Brazilian Portuguese support to entity_detector (closes #117)#156
igorls merged 2 commits intoMemPalace:developfrom
mvalentsev:feat/pt-br-entity-detection

Conversation

@mvalentsev
Copy link
Copy Markdown
Contributor

@mvalentsev mvalentsev commented Apr 7, 2026

What does this PR do?

Closes #117 by adding a Brazilian Portuguese locale (pt-br.json) to the i18n module. This is the first non-English locale to include the entity section introduced in #911, enabling entity detection for Portuguese text.

Single-file change, no Python modifications.

What's in pt-br.json

CLI strings -- palace terminology (palacio, ala, corredor, armario, gaveta), all CLI messages, AAAK compression instruction, regex patterns for Portuguese topic extraction.

Entity detection (the entity section):

  • candidate_pattern -- Latin+diacritics character class ([A-ZA-U][a-za-y]) so names like Joao, Ines, Angela are extracted as candidates
  • multi_word_pattern -- same charset for multi-word names
  • 15 person_verb_patterns -- disse, perguntou, respondeu, contou, riu, sorriu, chorou, sentiu, pensa, quer, ama, odeia, sabe, decidiu, escreveu
  • 8 pronoun_patterns -- ela/dele/ele/dela + plurals
  • 4 dialogue_patterns -- Portuguese quoted speech markers
  • direct_address_pattern -- oi, ola, obrigado/obrigada, caro/cara
  • 12 project_verb_patterns -- construindo, lancou, implantou, instalou + technical patterns
  • 69 Portuguese stopwords (greetings, adverbs, prepositions, conjunctions, determiners, pronouns)

Note: caro/cara are intentionally NOT in stopwords -- they are valid first names in Portuguese/Italian/English.

How to test

python -m pytest tests/test_entity_detector.py -v
python -m pytest tests/ --ignore=tests/benchmarks
ruff check .

Quick smoke test:

from mempalace.entity_detector import extract_candidates, score_entity

text = "Joao disse oi. Joao riu. Joao decidiu. Joao escreveu."
# English-only: Joao not found (ASCII-only candidate regex)
assert "Joao" not in extract_candidates(text, languages=("en",))
# With pt-br: Joao found
assert "Joao" in extract_candidates(text, languages=("en", "pt-br"))

Checklist

  • JSON valid, structure matches en.json (all keys present)
  • All {variable} interpolations match en.json
  • Entity patterns load and merge correctly via get_entity_patterns(("en", "pt-br"))
  • Accented names (Joao, Ines) extracted by pt-br candidate_pattern
  • PT-BR person verbs score correctly in score_entity
  • English-only detection unchanged (regression-clean)
  • Lint: ruff check . clean

Original PR description (before #911 refactor, no longer applies)

What does this PR do?

Closes #117 by extending entity_detector so a file written in Brazilian Portuguese is treated the same way an English file is: names get extracted as candidates, and verb / pronoun / dialogue / direct-address patterns contribute to the person-vs-project classification. The change is purely additive, so English-only corpora behave exactly as before.

Concretely:

  • New PERSON_VERB_PATTERNS_PTBR, PRONOUN_PATTERNS_PTBR, DIALOGUE_PATTERNS_PTBR constants with the Portuguese equivalents of the existing English signals (said / asked / replied / thinks / wants, plus greetings oi / ola / obrigado / caro).
  • _build_patterns concatenates the English and pt-br lists for the dialogue and person-verb buckets, so every compiled matcher for an entity now covers both languages at once.
  • score_entity merges the English and pt-br pronoun lists for the proximity check.
  • extract_candidates widens its Latin-1 character class so accented names like Joao, Ines, Angela, and Andre flow through candidate extraction instead of being silently dropped by an ASCII-only regex.
  • STOPWORDS gets the Portuguese greeting fillers (oi, ola, obrigado, obrigada, caro, cara) so they do not masquerade as entity candidates when they start sentences.

This approach was replaced after #911 landed -- all patterns now live in mempalace/i18n/pt-br.json instead of Python constants. Same detection coverage, zero Python changes.

@bgauryy
Copy link
Copy Markdown

bgauryy commented Apr 8, 2026

PR Review: feat: add Brazilian Portuguese support to entity_detector (closes #117)

Executive Summary

Aspect Value
PR Goal Extend entity_detector to recognise Brazilian Portuguese person names using pt-br verb, pronoun, dialogue, and direct-address patterns
Files Changed 2
Risk Level 🟢 LOW - purely additive patterns; English-only corpora unaffected
Review Effort 2 - well-scoped, single-module change with comprehensive tests
Recommendation 💬 COMMENT — one scoring asymmetry worth fixing before merge

Affected Areas: mempalace/entity_detector.py (pattern constants, extract_candidates, _build_patterns, score_entity), tests/test_entity_detector.py (new file)

Business Impact: Enables person-entity detection in Brazilian Portuguese text and mixed EN/PT-BR corpora. Users mining Portuguese conversations will now see the same quality of entity extraction they get with English content.

Flow Changes: extract_candidates now matches accented Latin-1 characters (À-ÿ). _build_patterns and score_entity operate on combined EN+PT-BR pattern lists, increasing the regex count per entity by ~40%.

Ratings

Aspect Score
Correctness 4/5
Security 5/5
Performance 5/5
Maintainability 4/5

PR Health


Medium Priority Issues

🐛 #1: Portuguese direct-address patterns double-counted in person_verbs + direct

Location: mempalace/entity_detector.pyPERSON_VERB_PATTERNS_PTBR (new lines ~86–90) and _build_patterns direct regex | Confidence: ✅ HIGH

Five Portuguese greetings (oi, olá, obrigado/a, caro/a) appear in both PERSON_VERB_PATTERNS_PTBR and the direct compiled regex. In score_entity, person_verbs matches add +2 per pattern while direct matches add +4 per hit, so "oi Maria" scores 6 points in Portuguese versus 4 points for the semantically identical "hi Maria" in English.

This inflates the person_score for Portuguese direct-address by ~50% compared to English, creating an asymmetric scoring model.

 PERSON_VERB_PATTERNS_PTBR = [
     r"\b{name}\s+disse\b",  # said
     r"\b{name}\s+perguntou\b",  # asked
     r"\b{name}\s+respondeu\b",  # replied
     r"\b{name}\s+contou\b",  # told
     r"\b{name}\s+riu\b",  # laughed
     r"\b{name}\s+sorriu\b",  # smiled
     r"\b{name}\s+chorou\b",  # cried
     r"\b{name}\s+sentiu\b",  # felt
     r"\b{name}\s+pensa\b",  # thinks
     r"\b{name}\s+quer\b",  # wants
     r"\b{name}\s+ama\b",  # loves
     r"\b{name}\s+odeia\b",  # hates
     r"\b{name}\s+sabe\b",  # knows
     r"\b{name}\s+decidiu\b",  # decided
     r"\b{name}\s+escreveu\b",  # wrote
-    r"\boi\s+{name}\b",  # hi
-    r"\bol[áa]\s+{name}\b",  # hello
-    r"\bobrigad[oa]\s+{name}\b",  # thanks
-    r"\bcaro\s+{name}\b",  # dear
-    r"\bcara\s+{name}\b",  # dear (feminine)
 ]

Remove the last 5 entries from PERSON_VERB_PATTERNS_PTBR — they already live in the direct regex where they belong (matching the English pattern of keeping verbs and greetings separate). The test test_portuguese_direct_address asserts person_score >= 12 and will still pass since 3 direct hits × 4 = 12.


Low Priority Issues

🎨 #2: DIALOGUE_PATTERNS_PTBR contains only one pattern

Location: mempalace/entity_detector.pyDIALOGUE_PATTERNS_PTBR | Confidence: ⚠️ MED

English DIALOGUE_PATTERNS has 5 entries covering variations (said, asked, replied, wrote, told in dialogue context). The Portuguese equivalent only has disse. Consider adding perguntou and respondeu in dialogue context for parity:

DIALOGUE_PATTERNS_PTBR = [
    r'"{name}\s+disse',
    r'"{name}\s+perguntou',
    r'"{name}\s+respondeu',
]

🔗 #3: No Portuguese PROJECT_VERB_PATTERNS — project signals absent for pt-br text

Location: mempalace/entity_detector.pyPROJECT_VERB_PATTERNS (unchanged) | Confidence: ⚠️ MED

The PR adds person-detection patterns but leaves project-detection English-only. In a fully Portuguese file, a project entity (e.g. "Construindo o MemPalace") won't get any project-signal boost. This is fine if the current scope is intentionally person-detection only, but worth tracking as a follow-up for completeness.


What's Done Well

  • Unicode regex for accented names: The [A-ZÀ-ÖØ-Þ][a-zà-öø-ÿ] range correctly covers the Latin-1 supplement while excluding the multiplication (×) and division (÷) signs.
  • Test coverage is thorough: 8 test functions covering verbs, pronouns, direct address, mixed corpora, dialogue markers, detect_entities integration, and accented names.
  • Additive design: English-only corpora are completely unaffected since patterns are concatenated, not replaced.
  • Stopword additions: Portuguese filler/greeting words (oi, olá, obrigado/a, caro/a) correctly added to prevent them from being extracted as name candidates.

Created by Octocode MCP https://octocode.ai 🔍🐙

@mvalentsev mvalentsev force-pushed the feat/pt-br-entity-detection branch from b4ea25e to 0990c10 Compare April 9, 2026 06:17
@mvalentsev
Copy link
Copy Markdown
Contributor Author

Quick check on the asymmetry claim: English PERSON_VERB_PATTERNS already contains hey, hi, thanks in addition to dear, and the direct regex matches hey, hi, thanks. So "hi Maria" in English scores the same way as "oi Maria" in PT-BR — +2 from person_verbs plus +4 from direct for a total of 6 points. The PR follows that pattern exactly:

Greeting person_verbs direct Total
"hi Maria" +2 +4 6
"oi Maria" +2 +4 6
"dear Maria" +2 - 2
"caro Maria" +2 - 2

No asymmetry — the two languages produce the same scores for semantically equivalent inputs. The double-counting of greetings is pre-existing design for English and was intentionally mirrored for PT-BR so that behaviour stays consistent across languages.

Rebased on latest main. The pt-br entity tests still pass locally.

@mvalentsev mvalentsev force-pushed the feat/pt-br-entity-detection branch from 0990c10 to 879de92 Compare April 9, 2026 16:44
@mvalentsev mvalentsev force-pushed the feat/pt-br-entity-detection branch 3 times, most recently from 4f05ed5 to 3d250ad Compare April 9, 2026 21:05
EndeavorYen added a commit to EndeavorYen/mempalace that referenced this pull request Apr 10, 2026
…ation

Replace per-language keyword/regex heuristics with embedding-based semantic
classification, enabling MemPalace to work with 50+ languages using zero
per-language configuration.

Changes:
- Room classification: cosine similarity against room description embeddings
- Memory extraction: embedding-based classification (5 types, any language)
- Entity detection: add Chinese name patterns (百家姓 surnames)
- Spellcheck: auto-skip CJK text via Unicode detection
- Embedding provider: pluggable via get_embedding_function() with caching
  - Default: paraphrase-multilingual-MiniLM-L12-v2 (sentence-transformers)
  - Ollama: "ollama:<model>" prefix (e.g., ollama:qwen3-embedding-8b)
  - Configurable via MEMPALACE_EMBEDDING_MODEL env var or config.json
- Knowledge graph: temporal triples, multi-hop traversal, auto-extraction
- Dialect: CJK bigram extraction for topic keywords
- All ChromaDB consumers route through centralized embedding function

New optional dependency: sentence-transformers>=2.0
Install: pip install mempalace[multilingual]
Without it: English regex fallback (existing behavior unchanged)

Benchmark: 173/173 (100%) across 8 languages
(zh-Hans, zh-Hant, en, fr, es, de, ja, ko)

652 tests passing, 0 failures. CI-compatible (multilingual tests
skip gracefully when sentence-transformers is not installed).

Closes MemPalace#231. Related: MemPalace#37, MemPalace#50, MemPalace#92, MemPalace#117, MemPalace#156, MemPalace#273.
EndeavorYen added a commit to EndeavorYen/mempalace that referenced this pull request Apr 10, 2026
…ation

Replace per-language keyword/regex heuristics with embedding-based semantic
classification, enabling MemPalace to work with 50+ languages using zero
per-language configuration.

Changes:
- Room classification: cosine similarity against room description embeddings
- Memory extraction: embedding-based classification (5 types, any language)
- Entity detection: add Chinese name patterns (百家姓 surnames)
- Spellcheck: auto-skip CJK text via Unicode detection
- Embedding provider: pluggable via get_embedding_function() with caching
  - Default: paraphrase-multilingual-MiniLM-L12-v2 (sentence-transformers)
  - Ollama: "ollama:<model>" prefix (e.g., ollama:qwen3-embedding-8b)
  - Configurable via MEMPALACE_EMBEDDING_MODEL env var or config.json
- Knowledge graph: temporal triples, multi-hop traversal, auto-extraction
- Dialect: CJK bigram extraction for topic keywords
- All ChromaDB consumers route through centralized embedding function

New optional dependency: sentence-transformers>=2.0
Install: pip install mempalace[multilingual]
Without it: English regex fallback (existing behavior unchanged)

Benchmark: 173/173 (100%) across 8 languages
(zh-Hans, zh-Hant, en, fr, es, de, ja, ko)

652 tests passing, 0 failures. CI-compatible (multilingual tests
skip gracefully when sentence-transformers is not installed).

Closes MemPalace#231. Related: MemPalace#37, MemPalace#50, MemPalace#92, MemPalace#117, MemPalace#156, MemPalace#273.
EndeavorYen added a commit to EndeavorYen/mempalace that referenced this pull request Apr 10, 2026
…ation

Replace per-language keyword/regex heuristics with embedding-based semantic
classification, enabling MemPalace to work with 50+ languages using zero
per-language configuration.

Changes:
- Room classification: cosine similarity against room description embeddings
- Memory extraction: embedding-based classification (5 types, any language)
- Entity detection: add Chinese name patterns (百家姓 surnames)
- Spellcheck: auto-skip CJK text via Unicode detection
- Embedding provider: pluggable via get_embedding_function() with caching
  - Default: paraphrase-multilingual-MiniLM-L12-v2 (sentence-transformers)
  - Ollama: "ollama:<model>" prefix (e.g., ollama:qwen3-embedding-8b)
  - Configurable via MEMPALACE_EMBEDDING_MODEL env var or config.json
- Knowledge graph: temporal triples, multi-hop traversal, auto-extraction
- Dialect: CJK bigram extraction for topic keywords
- All ChromaDB consumers route through centralized embedding function

New optional dependency: sentence-transformers>=2.0
Install: pip install mempalace[multilingual]
Without it: English regex fallback (existing behavior unchanged)

Benchmark: 173/173 (100%) across 8 languages
(zh-Hans, zh-Hant, en, fr, es, de, ja, ko)

652 tests passing, 0 failures. CI-compatible (multilingual tests
skip gracefully when sentence-transformers is not installed).

Closes MemPalace#231. Related: MemPalace#37, MemPalace#50, MemPalace#92, MemPalace#117, MemPalace#156, MemPalace#273.
@mvalentsev mvalentsev force-pushed the feat/pt-br-entity-detection branch 2 times, most recently from e15ccd1 to 0afc71f Compare April 10, 2026 15:52
@mvalentsev mvalentsev force-pushed the feat/pt-br-entity-detection branch from 0afc71f to 3e9435a Compare April 10, 2026 16:43
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Brazilian Portuguese Support for entity_detector

Well-considered i18n addition. The "additive patterns, no language gating" approach is pragmatic and correct — most real-world corpora are mixed-language anyway.

What's done well

Additive design over language detection. Rather than classifying files as English vs Portuguese and switching pattern sets, the PT-BR patterns are merged into _build_patterns() alongside the English ones. This is the right call: our integration processes 540+ discoveries and roughly 15–20% contain mixed-language content. Additive patterns handle this cleanly; a language-switch would miss the overlap.

Regex range extension in extract_candidates. Changing [A-Z] to [A-ZÀ-ÖØ-Þ] and [a-z] to [a-zà-öø-ÿ] is correct ISO Latin-1 supplement coverage. João, Inês, Ângela, and André all get picked up. The test test_detect_entities_picks_up_accented_names verifies this end-to-end.

STOPWORDS additions are appropriate. oi, olá, obrigado/a, caro, cara are all high-frequency PT-BR words that would otherwise score as entity candidates. The accented olá alongside ASCII ola handles both typed forms.

Test coverage is thorough. Eight tests including mixed corpus, pronoun proximity, direct address, dialogue markers, and accented names. test_mixed_english_portuguese_corpus (checking that mixed > English-only person score) is especially good.

Issues found

cara and caro added to STOPWORDS, but they're also in the pattern list. PERSON_VERB_PATTERNS_PTBR includes r"\bcaro\s+{name}\b" and r"\bcara\s+{name}\b" as direct-address markers. If someone is literally named "Cara" or "Caro", those names are now silently dropped by STOPWORDS before they reach pattern scoring. The patterns would never fire. Consider removing these two from STOPWORDS and leaving them only in the direct-address pattern (where they're already context-guarded by the following name).

ama (loves) and quer (wants) are short common verbs with significant collision risk. The pattern \b{name}\s+ama\b will match "Maria ama" correctly. But {name} here is the escaped entity name, so the collision is actually low — the pattern only fires when the entity name precedes the verb. Not a bug, just worth noting for the next i18n contributor.

No Spanish cognate guard. disse, perguntou, decidiu are distinctly PT-BR. But quer and ama appear in Spanish too (and sabe is identical in Spanish). For a PT-BR-specific PR this is fine, but if ES support is added later, the pattern lists may interact. A comment flagging this would be helpful.

PRONOUN_PATTERNS_PTBR are bare patterns without \b on both sides for multi-word forms. r"\bela\b" is correct but r"\bdelas\b" and r"\bdeles\b" are fine too — word boundaries on both sides. This is actually good. ✓

test_portuguese_direct_address asserts person_score >= 12 — this is a magic number tied to the current scoring weights. If weights change, the test breaks. Consider asserting person_score > 0 and len(patterns["direct"].findall(text)) == 3 separately (the test already does the latter).

Language detection is absent by design — but there's no documentation of this decision. A comment in entity_detector.py noting "PT-BR patterns are additive and always active; see issue #117" would help future contributors understand why there's no lang= parameter.

Suggestions

  1. Remove cara/caro from STOPWORDS (or add a note that they're intentionally excluded from entity detection since they're in direct-address patterns)
  2. Replace the magic >= 12 assertion with > 0 for score stability
  3. Add a module comment explaining the additive-patterns design decision
  4. Consider a test with a Portuguese common noun that should NOT be classified as a person (e.g., a project or tool with a PT-BR name)

Overall

Clean, well-tested i18n work. The additive approach is the right architecture, the regex range extension is correct, and the test suite is more thorough than most i18n PRs. The cara/caro STOPWORDS issue is the only real correctness concern.

APPROVEDcara/caro STOPWORDS issue is worth addressing before merge but not a hard blocker.


Reviewed by MemPalace-AGI — autonomous research system with perfect memory

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #156feat: add Brazilian Portuguese support to entity_detector

A well-scoped internationalization addition that extends entity detection to pt-BR corpora. 126 new tests, Unicode-aware candidate extraction, and an additive (non-breaking) pattern strategy. Strong execution on a genuinely useful feature.

What works well

Additive pattern strategy: Appending PTBR patterns to the existing English lists rather than forking detection logic is the right call. Mixed English/Portuguese corpora (very common in Brazilian tech teams) work without any language-classification step — a real-world win. The test test_mixed_english_portuguese_corpus validates this explicitly.

Unicode candidate extraction: The regex expansion from [A-Z][a-z]{1,19} to [A-ZÀ-ÖØ-Þ][a-zà-öø-ÿ]{1,19} is correct Unicode Latin Extended-A/B coverage. João, Inês, Ângela, and André will all be picked up. The multi-word match regex receives the same treatment consistently — good.

STOPWORDS additions: Adding oi, olá, obrigado/a, caro, and cara prevents common Portuguese greetings from being scored as entity names. Correct and necessary.

direct pattern inline expansion: Rather than creating a new pattern list, the direct regex is extended inline with |\\boi\\s+{n}\\b|\\bol[áa]\\s+{n}\\b|\\bobrigad[oa]\\s+{n}\\b. This is clean and avoids a fourth pattern category. The [áa] alternation handles both accented and ASCII-normalized forms (important for older systems that may strip diacritics).

Test coverage: 126 tests covering: English-only person verbs, Portuguese-only person verbs, pronoun proximity, direct address (3 forms), mixed corpus scoring, dialogue marker detection, detect_entities() integration, and accented names. This is thorough.

Issues / suggestions

PRONOUN_PATTERNS_PTBR creates false positives on Spanish: ela, ele, eles, elas are also valid Spanish words with different meanings, and deles/delas are close to Spanish forms. For a repository used internationally, this could cause over-detection in Spanish-language files. A note in the docstring explaining this tradeoff (and that the patterns are additive, not isolated to pt-BR files) would help future contributors understand the design decision.

cara as STOPWORD: cara is both a pt-BR filler word ("dude/dear") and a valid Italian/Spanish/Portuguese proper-noun component. Adding it as a stopword means a person named Cara in an English document would be missed. Consider scoping this more carefully — or add a comment explaining the tradeoff.

ama pattern: r"\\b{name}\\s+ama\\b" (loves) will match Portuguese entities, but ama is also a common English suffix in names like Obama, Alabama, etc. The word-boundary anchors on {name} protect against this, but the reverse case — a short name like Ana ama (Ana loves) matching a word-boundary fragment in English text — is worth noting.

No language detection fallback: The additive approach is intentionally language-agnostic, but the PR description could document this explicitly so future contributors know why there is no lang= parameter. Currently the intent is implicit.

olá in STOPWORDS as olá (with accent) + ola (without): Good — both forms are correctly listed. However, o alone is a very common Portuguese article that appears adjacent to proper nouns in patterns like o João fez.... The pattern set does not cover o/a <Name> verb constructions. This is an understandable scope limitation but worth flagging as a follow-up.

Minor

  • shutil and tempfile imports in tests are correct and used; no unused imports.
  • _build_patterns exported in __init__ check: ensure it is accessible for the test import to work.
  • Test file uses tempfile.mkdtemp() with manual cleanup in finally — correct pattern.

Verdict

Solid, well-tested i18n addition. The additive strategy is the right architectural choice for a mixed-corpus tool. The cara/ama edge cases are minor and worth a follow-up issue rather than a blocker. Ready for merge with perhaps a brief doc note about the language-agnostic design intent.


Reviewed by MemPalace-AGI — autonomous research system with perfect memory

@mvalentsev
Copy link
Copy Markdown
Contributor Author

Removed caro and cara from STOPWORDS with an explanatory comment. They stay in PERSON_VERB_PATTERNS_PTBR as direct-address markers, so "caro Maria" still fires the pattern; they just no longer silently drop a person literally named Cara or Caro at candidate extraction time. Added test_extract_candidates_keeps_cara_and_caro_as_names as a regression guard.

Kept oi / ola / olá / obrigado / obrigada as stopwords -- they're practically never first names in real corpora, and keeping them out cuts candidate noise on PT-BR greetings.

The other notes (Spanish cognate risk on ama / quer / sabe, module-level doc on the additive design) make sense as follow-ups. The >= 12 in test_portuguese_direct_address is intentional -- it locks the current weights so accidental score drift breaks the test loudly.

@web3guru888
Copy link
Copy Markdown

Solid additive implementation. A few observations:

What's done well:

  • Extending extract_candidates to accept accented characters ([A-ZÀ-ÖØ-Þ][a-zà-öø-ÿ]) is the right change — João, Inês, Ângela would all be silently dropped from candidate extraction under the old ASCII-only regex. This affects detection quality for accented names even in English corpora, not just Portuguese.
  • The decision to NOT add caro/cara to STOPWORDS is correct. They're valid first names in English/Italian/Portuguese. The explanatory comment makes the reasoning explicit so future contributors don't accidentally add them.
  • Merging PTBR patterns in _build_patterns() at compile time rather than at match time is the right call — the compiled regex cache benefits from this.

One open question:
The PERSON_VERB_PATTERNS_PTBR list covers ~14 common verbs. For detection to trigger at the score_entity level, the name needs to appear 3+ times in candidate extraction AND score above the classification threshold. Portuguese corpora with primarily dialogue-style text (WhatsApp logs, meeting notes) will hit the threshold well, but technical documents in Portuguese that reference a person occasionally might not. Worth testing against that use case before the PR lands.

Test coverage:
The test_mixed_english_portuguese_corpus test is the most important — it proves that adding PTBR patterns doesn't degrade English detection. Good to see it explicitly here. The test_detect_entities_picks_up_accented_names with João and Inês is exactly the right integration test.

This is a genuine addition that benefits any workspace with Portuguese contributors. LGTM.

@mvalentsev mvalentsev force-pushed the feat/pt-br-entity-detection branch from 4bb281e to b6d597b Compare April 11, 2026 11:27
@mvalentsev
Copy link
Copy Markdown
Contributor Author

The 3+ frequency threshold lives in extract_candidates itself and applies equally to English -- a name mentioned once or twice won't surface regardless of language. It's a pre-existing constraint on the whole detector, not something this PR introduces for PT-BR.

@bensig bensig changed the base branch from main to develop April 11, 2026 22:23
@bensig bensig requested a review from igorls as a code owner April 11, 2026 22:23
@mvalentsev mvalentsev force-pushed the feat/pt-br-entity-detection branch 2 times, most recently from cc5f60c to 6e7946a Compare April 12, 2026 06:49
@mvalentsev mvalentsev force-pushed the feat/pt-br-entity-detection branch 4 times, most recently from 5639b00 to a55770a Compare April 13, 2026 23:49
@igorls igorls added the area/i18n Multilingual, Unicode, non-English embeddings label Apr 14, 2026
@mvalentsev mvalentsev force-pushed the feat/pt-br-entity-detection branch 2 times, most recently from c3229f9 to c0392be Compare April 15, 2026 06:03
igorls added a commit that referenced this pull request Apr 15, 2026
Move all entity-detection lexical patterns (person verbs, pronouns,
dialogue markers, project verbs, stopwords, candidate character class)
out of hardcoded module-level constants and into the entity section of
each locale's JSON in mempalace/i18n/. Adds a languages parameter to
every public function so callers union patterns across the desired
locales. The default stays ("en",), so all existing callers and tests
behave unchanged.

Also adds:
- get_entity_patterns(langs) helper in mempalace/i18n/ that merges
  patterns across requested languages, dedupes lists, unions stopwords,
  and falls back to English for unknown locales
- MempalaceConfig.entity_languages property + setter, with env var
  override (MEMPALACE_ENTITY_LANGUAGES, comma-separated)
- mempalace init --lang en,pt-br flag (persists to config.json)
- Per-language candidate_pattern so non-Latin scripts (Cyrillic,
  Devanagari, CJK) can register their own character classes instead of
  being silently dropped by the ASCII-only [A-Z][a-z]+ default
- _build_patterns LRU cache keyed by (name, languages) so multi-language
  callers don't poison each other's cache slots

Why now: the open language PRs (#760 ru, #773 hi, #778 id, #907 it) only
add CLI strings via mempalace/i18n/. PR #156 (pt-br) is the first that
needed entity_detector changes and inlined a _PTBR variant of every
constant. That doesn't scale past 2-3 languages — every text gets
checked against every language's patterns regardless of relevance, and
candidate extraction still drops accented and non-Latin names.

This PR sets the standard so future locale contributors only edit one
JSON file (no Python changes), and entity detection scales linearly
with how many languages a user actually enabled, not how many ship.
@mvalentsev mvalentsev force-pushed the feat/pt-br-entity-detection branch from c0392be to 342568a Compare April 15, 2026 12:51
@mvalentsev
Copy link
Copy Markdown
Contributor Author

@igorls Reworked as JSON-only per #911 -- first locale with the entity section. CLI strings, person-verb/pronoun/dialogue patterns, and a Latin+diacritics candidate pattern for accented names (Joao, Ines, etc). All CI green.

Also added a Cyrillic entity section to #760 (ru.json) following the same pattern.

@mvalentsev
Copy link
Copy Markdown
Contributor Author

mvalentsev commented Apr 15, 2026

Heads up: the entity stopwords list here (30 words) is baseline only. Words like "Para", "Sobre", "Entre" at the start of a sentence match the candidate_pattern and produce false positives in entity detection. Probably worth expanding with Portuguese prepositions (para, sobre, entre, desde, contra, perante, etc.) and conjunctions (porém, contudo, embora, enquanto, etc.).

@igorls
Copy link
Copy Markdown
Collaborator

igorls commented Apr 15, 2026

Excellent rework, @mvalentsev — clean shape, 128 lines of JSON vs 216 of Python, and you're the first locale using the new entity section. This becomes the reference for other contributors. CI all green against the current develop (with #758/#760 merged).

Two concrete issues I caught running it locally:

1. Typo in dialogue_patterns[0] — won't match markdown-style quotes

Current:

"^\">\\s*{name}[:\\s]",

That compiles to ^">\s*Maria[:\s] — which requires a literal "> at the start of the line. Standard markdown quote lines like > Maria: hello won't match. The en.json equivalent is "^>\\s*{name}[:\\s]" (no leading \"). Quick fix:

"^>\\s*{name}[:\\s]",

Verified locally — > Maria: hello fails against the current pattern and passes against the corrected one.

2. Follow-up on your own stopwords note — concrete list

Your comment already flagged this, and I confirmed it: running the candidate_pattern against a pile of sentence-starting Portuguese prepositions/conjunctions, these currently surface as false-positive entity candidates:

word in entity.stopwords would surface
Para yes
Como yes
Mas yes
Porém yes
Sobre no
Entre no
Talvez no
Depois no

Your regex.stop_words (used by the AAAK compressor) already has para, como, mas, porém, embora, porque — but the entity.stopwords list (used by entity_detector) is a separate list and missing them. Worth syncing. Concrete suggestion to add:

para, como, mas, porém, contudo, embora, enquanto, porque, portanto, logo, todavia, desde, contra, perante, após, mediante, durante, conforme, segundo, exceto, pois, assim, também, apenas

Since pt-br is the reference implementation and the stopwords list ships with a tangible false-positive rate as-written, I'd prefer to roll this into the same PR rather than defer. Small follow-up commit should do it.

Nice-to-have, not blocking:

pronoun_patterns currently covers 3rd-person (ele/ela/deles/delas) but not 2nd-person (você, vocês) or possessives (seu, sua, seus, suas, teu, tua). Pronoun proximity is a weak signal, so missing these just means slightly lower person-confidence for people referenced in 2nd person. Up to you whether to add now or later.

The direct_address_pattern with ol[áa] and obrigad[oa] is a nice touch — handles both accented and unaccented casual typing.

Once the two above are addressed I'll merge. Thanks again for pushing through the rework.

@mvalentsev mvalentsev force-pushed the feat/pt-br-entity-detection branch from 9fd98dc to 540bab2 Compare April 15, 2026 17:04
@mvalentsev
Copy link
Copy Markdown
Contributor Author

@igorls Both fixed. Also added the 2nd-person pronouns (você/vocês, seu/sua/seus/suas) while at it.

Verified locally: extract_candidates filters out Para/Como/Porém, dialogue_patterns[0] matches > Maria: hello, score_entity picks up the new pronoun proximity signals. 106 tests pass.

Heads up: pt-br is not my native language, I relied on LLM assistance for the linguistic choices. If any of the stopwords or verb forms look off to a native speaker, happy to correct.

@mvalentsev mvalentsev force-pushed the feat/pt-br-entity-detection branch from 540bab2 to e791806 Compare April 15, 2026 17:16
…oses MemPalace#117)

CLI strings, AAAK instruction, regex patterns, and entity section
with person-verb, pronoun, dialogue, and candidate patterns for
Latin+diacritics names (Joao, Ines, Angela).

Follows the i18n entity framework from MemPalace#911.
- dialogue_patterns[0]: remove stray \" before > (fixes markdown quote matching)
- entity stopwords: add 40 prepositions, conjunctions, and common words to reduce false positives
- pronoun_patterns: add 2nd-person (você/vocês) and possessives (seu/sua/seus/suas)
@mvalentsev mvalentsev force-pushed the feat/pt-br-entity-detection branch from e791806 to 4221589 Compare April 15, 2026 18:32
@igorls igorls merged commit 57b0b14 into MemPalace:develop Apr 15, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/i18n Multilingual, Unicode, non-English embeddings

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add PT-BR support for AAAK

4 participants