Customize token_match
for training
#13757
Unanswered
ivan-kleshnin
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Documentation at https://spacy.io/usage/training#custom-tokenizer suggests to override
Tokenizer
properties instead of recreating it:it does work, hovewer
def token_match(text: str) -> re.Match | None: # just a matching dummy fn return None @registry.callbacks("customize_tokenizer") def make_customize_tokenizer(): def customize_tokenizer(nlp): ... # add a special case nlp.tokenizer.add_special_case("_SPECIAL_", [{"ORTH": "_SPECIAL_"}]) + nlp.tokenizer.token_match = token_match return customize_tokenizer
fails with
If I fall-back to supposedly lower-level:
it works fine. If I pass
nlp.tokenizer.token_match = nlp.Defaults.token_match
instead of a custom function – it also works fine.Docs tell that the function should be "A function matching the signature of
re.compile(string).match
to find token matches.".The above signature
def token_match(text: str) -> re.Match | None
works outside of training scope, e.g for existing model.How to provide a completely custom function if I want
if-else
based, fully controlled logic instead of regular expressions?Beta Was this translation helpful? Give feedback.
All reactions