Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Sentence fails because of Segtok bug #3542

Closed
MattGPT-ai opened this issue Aug 31, 2024 · 2 comments
Closed

[Bug]: Sentence fails because of Segtok bug #3542

MattGPT-ai opened this issue Aug 31, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@MattGPT-ai
Copy link
Contributor

Describe the bug

There are weird edge cases of contractions that cause SegTok's split_contractions to give an invalid result, and these will cause Flair Sentence instances to fail to init. The SegTok bug is reported here

To Reproduce

from flair.data import Sentence

Sentence("OʼHaraʼs")

Expected behavior

Not sure exactly what we would want here, but maybe ["Ohara", "'s"]

Or perhaps if it doesn't fit a predictable regex, we would just return the original string, which is maybe not ideal but should never cause a failure

Logs and Stack traces

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/anaconda3/envs/resume_parser_py310/lib/python3.10/site-packages/flair/data.py", line 776, in __init__
    word_start_position: int = text.index(word, current_offset)
ValueError: substring not found

Screenshots

No response

Additional Context

Segtok as a package does not appear to be maintained, and hasn't been updated since 2019. The SynTok package is its successor (v2), and would be an improvement although I'm not sure it's a total drop-in replacement.

from syntok import tokenizer
tokenizer_instance = tokenizer.Tokenizer()
list(tokenizer_instance.tokenize("OʼHaraʼs"))

[<Token '' : 'O' @ 0>, <Token '' : 'ʼHaraʼs' @ 1>]

Environment

Versions:

Flair

0.14.0

Pytorch

2.3.0

Transformers

4.41.1

GPU

False

@MattGPT-ai MattGPT-ai added the bug Something isn't working label Aug 31, 2024
@helpmefindaname
Copy link
Collaborator

Hi @MattGPT-ai
I suppose that bug won't be fixed on our side, but I want to highlight a few options:

  • Do some data-cleaning on character-level: Specifically replace the ʼ by ' and by " and so on. This will not only prevent this bug, but also slightly improve the quality of predictions, as those special characters add no semantic meaning but (slightly) affect the transformer embeddings.
  • You can use other tokenizers if you like, by passing the Sentence(..., use_tokenizer=) argument to the sentence. Implementing your own tokenizer (like using syntok) should be straight forward.

About switching the default tokenizer: I am talking to @alanakbik about it. I recall that years ago there was the decision against syntok, but I am not sure if this is still up to date.

@helpmefindaname
Copy link
Collaborator

#876 (comment) seems to still be current.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants