You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are weird edge cases of contractions that cause SegTok's split_contractions to give an invalid result, and these will cause Flair Sentence instances to fail to init. The SegTok bug is reported here
To Reproduce
fromflair.dataimportSentenceSentence("OʼHaraʼs")
Expected behavior
Not sure exactly what we would want here, but maybe ["Ohara", "'s"]
Or perhaps if it doesn't fit a predictable regex, we would just return the original string, which is maybe not ideal but should never cause a failure
Logs and Stack traces
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/user/anaconda3/envs/resume_parser_py310/lib/python3.10/site-packages/flair/data.py", line 776, in __init__
word_start_position: int = text.index(word, current_offset)
ValueError: substring not found
Screenshots
No response
Additional Context
Segtok as a package does not appear to be maintained, and hasn't been updated since 2019. The SynTok package is its successor (v2), and would be an improvement although I'm not sure it's a total drop-in replacement.
from syntok import tokenizer
tokenizer_instance = tokenizer.Tokenizer()
list(tokenizer_instance.tokenize("OʼHaraʼs"))
Hi @MattGPT-ai
I suppose that bug won't be fixed on our side, but I want to highlight a few options:
Do some data-cleaning on character-level: Specifically replace the ʼ by ' and “ by " and so on. This will not only prevent this bug, but also slightly improve the quality of predictions, as those special characters add no semantic meaning but (slightly) affect the transformer embeddings.
You can use other tokenizers if you like, by passing the Sentence(..., use_tokenizer=) argument to the sentence. Implementing your own tokenizer (like using syntok) should be straight forward.
About switching the default tokenizer: I am talking to @alanakbik about it. I recall that years ago there was the decision against syntok, but I am not sure if this is still up to date.
Describe the bug
There are weird edge cases of contractions that cause SegTok's
split_contractions
to give an invalid result, and these will cause FlairSentence
instances to fail to init. The SegTok bug is reported hereTo Reproduce
Expected behavior
Not sure exactly what we would want here, but maybe
["Ohara", "'s"]
Or perhaps if it doesn't fit a predictable regex, we would just return the original string, which is maybe not ideal but should never cause a failure
Logs and Stack traces
Screenshots
No response
Additional Context
Segtok as a package does not appear to be maintained, and hasn't been updated since 2019. The SynTok package is its successor (v2), and would be an improvement although I'm not sure it's a total drop-in replacement.
[<Token '' : 'O' @ 0>, <Token '' : 'ʼHaraʼs' @ 1>]
Environment
Versions:
Flair
0.14.0
Pytorch
2.3.0
Transformers
4.41.1
GPU
False
The text was updated successfully, but these errors were encountered: