[Bug]: Sentence fails because of Segtok bug #3542

MattGPT-ai · 2024-08-31T01:35:05Z

Describe the bug

There are weird edge cases of contractions that cause SegTok's split_contractions to give an invalid result, and these will cause Flair Sentence instances to fail to init. The SegTok bug is reported here

To Reproduce

from flair.data import Sentence

Sentence("OʼHaraʼs")

Expected behavior

Not sure exactly what we would want here, but maybe ["Ohara", "'s"]

Or perhaps if it doesn't fit a predictable regex, we would just return the original string, which is maybe not ideal but should never cause a failure

Logs and Stack traces

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/anaconda3/envs/resume_parser_py310/lib/python3.10/site-packages/flair/data.py", line 776, in __init__
    word_start_position: int = text.index(word, current_offset)
ValueError: substring not found

Screenshots

No response

Additional Context

Segtok as a package does not appear to be maintained, and hasn't been updated since 2019. The SynTok package is its successor (v2), and would be an improvement although I'm not sure it's a total drop-in replacement.

from syntok import tokenizer
tokenizer_instance = tokenizer.Tokenizer()
list(tokenizer_instance.tokenize("OʼHaraʼs"))

[<Token '' : 'O' @ 0>, <Token '' : 'ʼHaraʼs' @ 1>]

Environment

Versions:

Flair

0.14.0

Pytorch

2.3.0

Transformers

4.41.1

GPU

False

The text was updated successfully, but these errors were encountered:

helpmefindaname · 2024-09-06T11:53:34Z

Hi @MattGPT-ai
I suppose that bug won't be fixed on our side, but I want to highlight a few options:

Do some data-cleaning on character-level: Specifically replace the ʼ by ' and “ by " and so on. This will not only prevent this bug, but also slightly improve the quality of predictions, as those special characters add no semantic meaning but (slightly) affect the transformer embeddings.
You can use other tokenizers if you like, by passing the Sentence(..., use_tokenizer=) argument to the sentence. Implementing your own tokenizer (like using syntok) should be straight forward.

About switching the default tokenizer: I am talking to @alanakbik about it. I recall that years ago there was the decision against syntok, but I am not sure if this is still up to date.

helpmefindaname · 2024-09-13T10:36:05Z

#876 (comment) seems to still be current.

MattGPT-ai added the bug Something isn't working label Aug 31, 2024

MattGPT-ai closed this as completed Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Sentence fails because of Segtok bug #3542

[Bug]: Sentence fails because of Segtok bug #3542

MattGPT-ai commented Aug 31, 2024

helpmefindaname commented Sep 6, 2024

helpmefindaname commented Sep 13, 2024

[Bug]: Sentence fails because of Segtok bug #3542

[Bug]: Sentence fails because of Segtok bug #3542

Comments

MattGPT-ai commented Aug 31, 2024

Describe the bug

To Reproduce

Expected behavior

Logs and Stack traces

Screenshots

Additional Context

Environment

Versions:

Flair

Pytorch

Transformers

GPU

helpmefindaname commented Sep 6, 2024

helpmefindaname commented Sep 13, 2024