Add Dutch and French to Extractomat by r-kovalch · Pull Request #1 · lang-uk/extractomat

r-kovalch · 2025-08-05T15:24:29Z

This PR intends to extend the extractomat's support for Dutch and French languages

During experimentation with ACTER data (notebook), we have discovered that the default implementation of extractomat returns no entities for French and Dutch languages. For Dutch, the issue was solved by extending UPOS to the Penn mapping dictionary, while for French, the problem was due to incorrect processing of space tokens. The latter was fixed by introducing a simple if statement to filter out spaces in the lemmatization. The space token was now allowed in input and discarded at lemmatization.

TLDR

To introduce French to extractomat, we did:

Add space token "SPACE": "NN" at line 56 to the UPOS mapping
Filtered it out within the get_lemmatized_phrase function
Verified that all tokens were mapped on the ACTER dataset for French (see notebook for applied example of usage)

To add Dutch to extractomat, we

Added new tokens to the UPOS mapping
Verified that all tokens were mapped on the ACTER dataset for Dutch (see notebook for applied example of usage)

Extractomat: Add Dutch and German

r-kovalch and others added 5 commits June 19, 2025 19:11

Added dutch mapping

f124f5f

Added dutch mapping

c2c37b4

Added dutch mapping

dacc87c

Updated matcha.py for extractomat

9183e8d

Merge pull request #1 from r-kovalch/acter

77308f5

Extractomat: Add Dutch and German

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dutch and French to Extractomat#1

Add Dutch and French to Extractomat#1
r-kovalch wants to merge 5 commits intolang-uk:masterfrom
r-kovalch:master

r-kovalch commented Aug 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

r-kovalch commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR intends to extend the extractomat's support for Dutch and French languages

TLDR

To introduce French to extractomat, we did:

To add Dutch to extractomat, we

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

r-kovalch commented Aug 5, 2025 •

edited

Loading