Commit a55770a
committed
fix: do not stopword caro/cara -- they are valid first names
caro and cara were in both STOPWORDS and PERSON_VERB_PATTERNS_PTBR as
direct-address markers. Because extract_candidates filters candidates
against STOPWORDS before pattern scoring runs, a person literally
named Cara or Caro (valid English/Italian/Portuguese first names) was
silently dropped from detection.
Remove caro / cara from STOPWORDS and leave the explanatory comment in
place. The direct-address patterns still fire when these words precede
another name (caro Maria, cara Ana), so PT-BR behaviour is unchanged
for the filler-word case.
oi / ola / ola / obrigado / obrigada stay as stopwords -- they are
practically never first names in the corpora this detector targets.1 parent ac4c0fd commit a55770a
2 files changed
Lines changed: 29 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
112 | 112 | | |
113 | 113 | | |
114 | 114 | | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
115 | 119 | | |
116 | 120 | | |
117 | 121 | | |
| |||
369 | 373 | | |
370 | 374 | | |
371 | 375 | | |
372 | | - | |
373 | | - | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
374 | 381 | | |
375 | 382 | | |
376 | 383 | | |
| |||
573 | 580 | | |
574 | 581 | | |
575 | 582 | | |
576 | | - | |
577 | | - | |
578 | | - | |
579 | 583 | | |
580 | 584 | | |
581 | | - | |
| 585 | + | |
582 | 586 | | |
583 | 587 | | |
584 | 588 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
504 | 504 | | |
505 | 505 | | |
506 | 506 | | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
0 commit comments