-
Notifications
You must be signed in to change notification settings - Fork 0
Symbols
A transducer maps strings into strings. Strings are tokenized (i.e. divided) into symbols. Each transition in a transducer has an input and output symbol. If the input symbol of a transition matches a symbol of an input string, it is consumed and an output symbol equal to the output symbol of the transition is produced.
There are some special symbols, including the epsilon, unknown and identity.
There is also a class of special symbols, called flag diacritics. They are of form
@[PNDRCU][.][A-Z]+([.][A-Z]+)?@
.
The epsilon consumes or produces no symbol. The unknown and the identity consume or produce any symbol, excluding the epsilon and flag diacritics.
An epsilon on input side consumes no symbol, an epsilon on output side produces no symbol. An unknown on input side matches any symbol, an unknown on output side produces any symbol. If an unknown is on both sides of a transition, it matches any symbol and produces any symbol other than the one that was matched on the input side. An identity matches any symbol and produces the same symbol. It must always occur on both sides of a transition.
Various functions differ in the way they handle special symbols, both in semantics and outward appearance.
The internal string representation for epsilon is @_EPSILON_SYMBOL_@
,
for unknown @_UNKNOWN_SYMBOL_@
and for identity @_IDENTITY_SYMBOL_@
.
They are available as hfst.EPSILON,
hfst.UNKNOWN and hfst.IDENTITY.
These strings are used when referring to those symbols in individual transitions, e.g.
fsm = hfst.HfstIterableTransducer()
fsm.add_state(1)
fsm.add_state(2)
fsm.set_final_weight(2, 0.5)
fsm.add_transition(0, 1, hfst.EPSILON, hfst.UNKNOWN)
fsm.add_transition(1, 2, hfst.IDENTITY, hfst.IDENTITY)
or reading and printing transitions in AT&T format (also in prolog format):
0 1 @_EPSILON_SYMBOL@ @_UNKNOWN_SYMBOL_@ 0.0
1 2 @_IDENTITY_SYMBOL@ @_IDENTITY_SYMBOL_@ 0.0
2 0.5
There is also a shorter string for epsilon in AT&T format, @0@
.
The syntax of regular expressions (hfst.regex, hfst.compile_lexc_file, hfst.compile_xfst_file) follows the Xerox formalism,
where the following symbols are used instead: 0
for epsilon, and ?
for unknown and identity.
On either side of a transition, ?
means the unknown. As a single symbol, ?
means identity-to-identity transition.
On both sides of a transition ?
means the combination of unknown-to-unknown AND identity-to-identity transitions.
If unknown-to-unknown transition is needed, it can be given as the subtraction [?:? - ?]
. Some examples:
hfst.regex('0:foo') # epsilon to "foo"
hfst.regex('foo:0') # "foo" to epsilon
hfst.regex('?:foo') # any symbol (including "foo") to "foo"
hfst.regex('foo:?') # "foo" to any symbol (including "foo")
hfst.regex('?:?') # any symbol to any symbol
hfst.regex('?') # any symbol to the same symbol
hfst.regex('?:? - ?') # any symbol to any other symbol
Note that unknowns and identities are expanded with the symbols that the expression becomes aware of during its compilation:
hfst.regex('?') # same as [?]
hfst.regex('? foo') # same as [[?|foo] foo]
hfst.regex('? foo bar:?') # same as [[?|foo|bar] foo [bar:?|bar:bar|bar:foo]]
However, remember that unknowns and identities are not expanded with epsilons and flag diacritics:
hfst.regex('?') # same as [?]
hfst.regex('? "@U.FOO.ON@"') # same as [? "@U.FOO.ON@"]
hfst.regex('?:foo 0:bar') # same as [[?:foo|foo:foo|bar:foo] 0:bar]
Also note that flag diacritics contain @
so they must be escaped with quotes if used in regular expressions, e.g. "@U.FOO.ON@"
.
In lookup, the epsilon is printed as empty string and unknowns and identities as those symbols that they are matched with (todo: for OL format, unknowns on output side are not expanded):
>>> tr = hfst.regex('foo:0 bar:? ?')
>>> print(tr.lookup('foobara'))
(('bara', 0.0), ('fooa', 0.0))
In extract_paths, epsilon, unknown and identity are printed as such:
>>> tr = hfst.regex('foo:0 bar:? ?')
>>> print(tr.extract_paths())
{'foobar@_IDENTITY_SYMBOL_@': [('@_EPSILON_SYMBOL_@@_UNKNOWN_SYMBOL_@@_IDENTITY_SYMBOL_@', 0.0), ('@_EPSILON_SYMBOL_@bar@_IDENTITY_SYMBOL_@', 0.0), ('@_EPSILON_SYMBOL_@foo@_IDENTITY_SYMBOL_@', 0.0)],
'foobarfoo': [('@_EPSILON_SYMBOL_@@_UNKNOWN_SYMBOL_@foo', 0.0), ('@_EPSILON_SYMBOL_@barfoo', 0.0), ('@_EPSILON_SYMBOL_@foofoo', 0.0)],
'foobarbar': [('@_EPSILON_SYMBOL_@@_UNKNOWN_SYMBOL_@bar', 0.0), ('@_EPSILON_SYMBOL_@barbar', 0.0), ('@_EPSILON_SYMBOL_@foobar', 0.0)]}
In lookup, flag diacritics (TODO)
>>> regexp_pass = hfst.regex('"@U.FOO.ON@" foo:bar "@U.FOO.ON@"')
>>> regexp_fail = hfst.regex('"@U.FOO.ON@" foo:bar "@U.FOO.OFF@"')
When extracting paths, flag diacritics are by default obeyed and not printed:
>>> regexp_pass = hfst.regex('"@U.FOO.ON@" foo:bar "@U.FOO.ON@"')
>>> regexp_fail = hfst.regex('"@U.FOO.ON@" foo:bar "@U.FOO.OFF@"')
>>> print(regexp_pass.extract_paths())
{'foo': [('bar', 0.0)]}
>>> print(regexp_fail.extract_paths())
{}
Use hfst.tokenized_fst if you need special or other multicharacter symbols:
>>> tr = hfst.tokenized_fst(('@U.FOO.ON@',('foo','bar'),'@U.FOO.ON@'))
>>> print(tr)
0 1 @U.FOO.ON@ @U.FOO.ON@ 0
1 2 foo bar 0
2 3 @U.FOO.ON@ @U.FOO.ON@ 0
3 0
In hfst.compile_sfst_file, unknowns and identities cannot be used as they are not part of the SFST formalism. Also, the symbol for epsilon is <>
and agreement variables must be used instead of flag diacritics.
In hfst.substitute, all special symbols are supported. For epsilon, unknown and identity, the internal representations (@_EPSILON_SYMBOL_@
etc.) are recognized. The function just performs a simple symbol substitution without considering any semantics.
In regular expressions:
f o o # three consecutive transitions with symbols 'f', 'o' and 'o'
{foo} # the same as above
foo # one transition with symbol 'foo'
In lookup, the input is tokenized using longest matching but using only symbols that occur on the input side of the transducer.
Example 1: we are looking up the string foo in transducer [foo:bar] | [?:B ?:A ?:R]
. The string is first tokenized as one symbol {'foo'} because such a multicharacter symbol is used on the input side of the transducer. The lookup itself will then match input side of expression [foo:bar]
and produce the output bar. (If the input had been tokenized as {'f','o','o'}, the result would have been BAR. However, multicharacter symbols take a higher precedence - that is why they are used in the first place.) On the other hand, the input foofoofoo will produce the output BAR because it will be tokenized as {'foo','foo','foo'} and match input side of expression [?:B ?:A ?:R]
.
Example 2: we are looking up the string foo in transducer [f:0 o:0 o:foo]
. The string is first tokenized as {'f','o','o'} because they are the only symbols that occur on the input side of the transducer. The lookup itself will then match the whole expression and produce the output foo. (If the input had been tokenized as one symbol {'foo'}, it would not have matched the expression - both Xerox's xfst and hfst-xfst consider that this would not be the correct interpretation).
Example 3: we are looking up the string foo in transducer [foo:bar]|[f:B o:A o:R]
. The string is first tokenized as one symbol {'foo'} which will match the input side of expression [foo:bar]
and produce the output bar. The part [f:B o:A o:R]
is basically redundant in this expression from the point of view of lookup.
Package hfst
- AttReader
- PrologReader
- HfstIterableTransducer
- HfstTransition
- HfstTransducer
- HfstInputStream
- HfstOutputStream
- MultiCharSymbolTrie
- HfstTokenizer
- LexcCompiler
- XreCompiler
- PmatchContainer
- ImplementationType