-
Notifications
You must be signed in to change notification settings - Fork 0
PmatchContainer
A class for performing pattern matching.
Probably the easiest way to perform pattern matching is with functions hfst.compile_pmatch_expression and hfst.compile_pmatch_file
Initialize a PmatchContainer. Is this needed?
Create a PmatchContainer based on definitions defs
.
-
defs
: A tuple of transducers in HFST_OLW_TYPE defining how pmatch is done.
An example:
If we have a file named streets.txt
that contains:
define CapWord UppercaseAlpha Alpha* ;
define StreetWordFr [{avenue} | {boulevard} | {rue}] ;
define DeFr [ [{de} | {du} | {des} | {de la}] Whitespace ] | [{d'} | {l'}] ;
define StreetFr StreetWordFr (Whitespace DeFr) CapWord+ ;
regex StreetFr EndTag(FrenchStreetName) ;
and which has been earlier compiled and stored in file streets.pmatch.hfst.ol
:
defs = hfst.compile_pmatch_file('streets.txt')
ostr = hfst.HfstOutputStream(filename='streets.pmatch.hfst.ol', type=hfst.ImplementationType.HFST_OLW_TYPE)
for tr in defs:
ostr.write(tr)
ostr.close()
we can read the pmatch definitions from file and perform string matching with:
istr = hfst.HfstInputStream('streets.pmatch.hfst.ol')
defs = []
while(not istr.is_eof()):
defs.append(istr.read())
istr.close()
cont = hfst.PmatchContainer(defs)
assert cont.match("Je marche seul dans l'avenue des Ternes.") == "Je marche seul dans l'<FrenchStreetName>avenue des Ternes</FrenchStreetName>."
See also: hfst.compile_pmatch_file, hfst.compile_pmatch_expression
Match input input
.
todo
todo
todo
todo
The locations of pmatched strings for string input where the results are limited as defined by time_cutoff and weight_cutoff.
-
input
: The input string. -
time_cutoff
: Time cutoff, defaults to zero, i.e. no cutoff. -
weight_cutoff
: Weight cutoff, defaults to infinity, i.e. no cutoff.
Returns: A tuple of tuples of Location.
Tokenize input
and return a list of tokens i.e. strings.
-
input
: The string to be tokenized.
Tokenize input
and get a string representation of the tokenization
(essentially the same that command line tool hfst-tokenize would give).
-
input
: The input string to be tokenized. -
kwargs
: Possible parameters are:output_format
,max_weight_classes
,dedupe
,print_weights
,print_all
,time_cutoff
,verbose
,beam
,tokenize_multichar
. -
output_format
: The format of output; possible values are tokenize, xerox, cg, finnpos, giellacg, conllu and visl; tokenize being the default. -
max_weight_classes
: Maximum number of best weight classes to output (where analyses with equal weight constitute a class), defaults to None i.e. no limit. -
dedupe
: Whether duplicate analyses are removed, defaults to False. -
print_weights
: Whether weights are printd, defaults to False. -
print_all
: Whether nonmatching text is printed, defaults to False. -
time_cutoff
: Maximum number of seconds used per input after limiting the search. -
verbose
: Whether input is processed verbosely, defaults to True. -
beam
: Beam within analyses must be to get printed. -
tokenize_multichar
: Tokenize input into multicharacter symbols present in the transducer, defaults to false.
Package hfst
- AttReader
- PrologReader
- HfstIterableTransducer
- HfstTransition
- HfstTransducer
- HfstInputStream
- HfstOutputStream
- MultiCharSymbolTrie
- HfstTokenizer
- LexcCompiler
- XreCompiler
- PmatchContainer
- ImplementationType