Add support for MasakhaPOS Dataset #3247

stefan-it · 2023-05-23T21:54:32Z

Hi,

this PR adds support for the recently proposed MasakhaPOS Dataset.

Details can be found in this tweet.

The dataset is available in this repo: https://github.com/masakhane-io/masakhane-pos

I received preprint of the paper and wrote unit tests to check number of parsed sentences for dataset splits for each language.

Example usage of MasakhaPOS:

from flair.datasets import MASAKHA_POS

corpus = MASAKHA_POS(languages="bam")

stefan-it · 2023-05-23T21:55:27Z

/cc @dadelani

stefan-it · 2023-05-23T21:56:38Z

@alanakbik Please let me know, if dataset name is ok: it does not quite match into the UD_ naming scheme, because dataset is not in Universal Dependencies format, as it only has token and upos.

dadelani · 2023-05-23T21:57:39Z

@stefan-it , the dataset name is MasakhaPOS, arXiv paper will be out tomorrow

stefan-it · 2023-05-23T22:05:20Z

Thanks @dadelani for feedback, I corrected the dataset name now :)

stefan-it · 2023-05-24T08:11:52Z

Preprint is now available here 🤗

stefan-it · 2023-05-24T11:33:44Z

Hi @helpmefindaname do you accidentally have an idea, why poetry stage will get stuck in timeout:

https://github.com/flairNLP/flair/actions/runs/5062453919/jobs/9099744358?pr=3247

This was the already the case yesterday, I've just re-ran the build, but same error.

helpmefindaname · 2023-05-26T10:44:14Z

Hi @helpmefindaname do you accidentally have an idea, why poetry stage will get stuck in timeout:

https://github.com/flairNLP/flair/actions/runs/5062453919/jobs/9099744358?pr=3247

This was the already the case yesterday, I've just re-ran the build, but same error.

The dependency resolution took needlessly long, as it tried out all 300+ boto3 versions with an incompatible dependency before taking the right one.
#3249 should fix the problems.

stefan-it · 2023-06-09T09:08:02Z

After #3258 I will do a rebase now :)

… currently missing and luo + tsn are missing

alanakbik · 2023-08-11T13:43:56Z

@stefan-it thanks for adding this! And thanks @dadelani for creating this dataset!

stefan-it changed the title ~~Add support for AfricaPOS Dataset~~ Add support for MasakhaPOS Dataset May 23, 2023

stefan-it force-pushed the add-africa-pos-dataset branch from 3515ba9 to 07630c5 Compare June 9, 2023 09:09

stefan-it marked this pull request as draft June 11, 2023 08:41

stefan-it force-pushed the add-africa-pos-dataset branch from ac07402 to 47ca73c Compare July 13, 2023 20:20

stefan-it force-pushed the add-africa-pos-dataset branch from 84b4f75 to 74f8602 Compare August 8, 2023 19:50

stefan-it added 9 commits August 11, 2023 12:06

datasets: include AFRICA_POS implementation

d8fe0b5

datasets: add support for AfricaPOS dataset

2bc445b

tests: adjust test cases for MasakhaPOS dataset

9e1e26e

datasets: fix MASAKHA_POS name

da36e0f

datasets: add support for MasakhaPOS dataset

d077266

tests: adjust test cases for MasakhaPOS dataset

5c53910

datasets: sync with latest MasakhaPOS GitHub version: test splits are…

d84092c

… currently missing and luo + tsn are missing

datasets: some minor work on MasakhaPOS dataset parsing

fccf83b

tests: sync MasakhaPOS tests with upstream repo

5bd4526

stefan-it force-pushed the add-africa-pos-dataset branch from 74f8602 to 5bd4526 Compare August 11, 2023 10:06

stefan-it marked this pull request as ready for review August 11, 2023 10:07

datasets: type -> isinstance fix

2ddae63

alanakbik merged commit 10a63dd into master Aug 11, 2023
1 check passed

alanakbik deleted the add-africa-pos-dataset branch August 11, 2023 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for MasakhaPOS Dataset #3247

Add support for MasakhaPOS Dataset #3247

stefan-it commented May 23, 2023 •

edited

Loading

stefan-it commented May 23, 2023

stefan-it commented May 23, 2023

dadelani commented May 23, 2023

stefan-it commented May 23, 2023

stefan-it commented May 24, 2023

stefan-it commented May 24, 2023

helpmefindaname commented May 26, 2023

stefan-it commented Jun 9, 2023

alanakbik commented Aug 11, 2023

Add support for MasakhaPOS Dataset #3247

Add support for MasakhaPOS Dataset #3247

Conversation

stefan-it commented May 23, 2023 • edited Loading

stefan-it commented May 23, 2023

stefan-it commented May 23, 2023

dadelani commented May 23, 2023

stefan-it commented May 23, 2023

stefan-it commented May 24, 2023

stefan-it commented May 24, 2023

helpmefindaname commented May 26, 2023

stefan-it commented Jun 9, 2023

alanakbik commented Aug 11, 2023

stefan-it commented May 23, 2023 •

edited

Loading