Skip to content

Add Dutch and French to Extractomat#1

Open
r-kovalch wants to merge 5 commits intolang-uk:masterfrom
r-kovalch:master
Open

Add Dutch and French to Extractomat#1
r-kovalch wants to merge 5 commits intolang-uk:masterfrom
r-kovalch:master

Conversation

@r-kovalch
Copy link

@r-kovalch r-kovalch commented Aug 5, 2025

This PR intends to extend the extractomat's support for Dutch and French languages

During experimentation with ACTER data (notebook), we have discovered that the default implementation of extractomat returns no entities for French and Dutch languages. For Dutch, the issue was solved by extending UPOS to the Penn mapping dictionary, while for French, the problem was due to incorrect processing of space tokens. The latter was fixed by introducing a simple if statement to filter out spaces in the lemmatization. The space token was now allowed in input and discarded at lemmatization.

TLDR

To introduce French to extractomat, we did:

  • Add space token "SPACE": "NN" at line 56 to the UPOS mapping
  • Filtered it out within the get_lemmatized_phrase function
  • Verified that all tokens were mapped on the ACTER dataset for French (see notebook for applied example of usage)

To add Dutch to extractomat, we

  • Added new tokens to the UPOS mapping
  • Verified that all tokens were mapped on the ACTER dataset for Dutch (see notebook for applied example of usage)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant