Commit 4f05ed5
committed
feat: add Brazilian Portuguese support to entity_detector
Closes #117.
Extends the detector so a file written in pt-br is treated the same
way a file in English is: names are extracted as candidates, and
verb / pronoun / dialogue / direct-address patterns contribute to the
person vs project classification. Pure-English corpora are unchanged
because every addition is additive.
The concrete pieces:
- New PERSON_VERB_PATTERNS_PTBR, PRONOUN_PATTERNS_PTBR,
DIALOGUE_PATTERNS_PTBR constants with the Portuguese equivalents of
the existing English signals (said / asked / replied / thinks /
wants, plus greetings oi / olá / obrigado / caro).
- _build_patterns concatenates the English and pt-br lists for the
dialogue and person-verb buckets, so _every_ compiled matcher for
an entity now covers both languages.
- score_entity merges the English and pt-br pronoun lists for the
proximity check.
- extract_candidates widens its Latin-1 character class so accented
names like João, Inês, Ângela, and André flow through candidate
extraction instead of being silently dropped by an ASCII-only regex.
- STOPWORDS adds the Portuguese greeting fillers (oi, olá, obrigado,
obrigada, caro, cara) so they do not masquerade as entity
candidates when they start sentences.
The new tests/test_entity_detector.py covers English regression,
pt-br person verbs (with a direct _build_patterns assertion so the
signal source is unambiguous), pt-br pronoun proximity, direct
address, a mixed-language corpus compared against English-only,
Portuguese dialogue markers in quoted speech, and end-to-end
detect_entities runs for both ASCII (Maria) and accented (João,
Inês) names.1 parent 0fdd086 commit 4f05ed5
2 files changed
Lines changed: 191 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
71 | 112 | | |
72 | 113 | | |
73 | 114 | | |
| |||
319 | 360 | | |
320 | 361 | | |
321 | 362 | | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
322 | 371 | | |
323 | 372 | | |
324 | 373 | | |
| |||
445 | 494 | | |
446 | 495 | | |
447 | 496 | | |
448 | | - | |
449 | | - | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
450 | 501 | | |
451 | 502 | | |
452 | 503 | | |
453 | 504 | | |
454 | 505 | | |
455 | 506 | | |
456 | 507 | | |
457 | | - | |
| 508 | + | |
458 | 509 | | |
459 | 510 | | |
460 | 511 | | |
| |||
469 | 520 | | |
470 | 521 | | |
471 | 522 | | |
| 523 | + | |
| 524 | + | |
472 | 525 | | |
473 | | - | |
474 | | - | |
475 | | - | |
476 | | - | |
| 526 | + | |
| 527 | + | |
477 | 528 | | |
478 | 529 | | |
479 | 530 | | |
480 | | - | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
481 | 536 | | |
482 | 537 | | |
483 | 538 | | |
| |||
514 | 569 | | |
515 | 570 | | |
516 | 571 | | |
| 572 | + | |
517 | 573 | | |
518 | 574 | | |
519 | | - | |
| 575 | + | |
520 | 576 | | |
521 | 577 | | |
522 | 578 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
4 | 7 | | |
5 | 8 | | |
6 | 9 | | |
7 | 10 | | |
8 | 11 | | |
| 12 | + | |
9 | 13 | | |
10 | 14 | | |
11 | 15 | | |
| |||
378 | 382 | | |
379 | 383 | | |
380 | 384 | | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
0 commit comments