Skip to content

Symbols

eaxelson edited this page Nov 16, 2017 · 21 revisions

Symbols in HFST

A transducer maps strings into strings. Strings are tokenized (i.e. divided) into symbols. Each transition in a transducer has an input and output symbol. If the input symbol of a transition matches a symbol of an input string, it is consumed and an output symbol equal to the output symbol of the transition is produced.

There are some special symbols: the epsilon, unknown and identity. Epsilon on input side consumes no symbol, epsilon on output side produces no symbol. The unknown on input side matches any symbol, the unknown on output side produces any symbol. If unknown is on both sides of a transition, it matches any symbol and produces any symbol other than the one that was matched on the input side. The identity matches any symbol and produces the same symbol. It must always occur on both sides of a transition. There is also a class of special symbols, called flag diacritics. They are of form @[PNDRCU][.][A-Z]+([.][A-Z]+)?@.

Various functions differ in the way they handle special symbols, both in semantics and outward appearance.

Creating transitions from scratch and converting between AT&T and binary formats

The internal string representation for epsilon is @_EPSILON_SYMBOL_@ (hfst.EPSILON), for unknown @_UNKNOWN_SYMBOL_@ (hfst.UNKNOWN) and for identity @_IDENTITY_SYMBOL_@ (hfst.IDENTITY). These strings are used when referring to those symbols in individual transitions, e.g.

fsm = hfst.HfstBasicTransducer()
fsm.add_state(1)
fsm.add_state(2)
fsm.set_final_weight(2, 0.5)
fsm.add_transition(0, 1, hfst.EPSILON, hfst.UNKNOWN)
fsm.add_transition(1, 2, hfst.IDENTITY, hfst.IDENTITY)

or reading and printing transitions in AT&T format (also for prolog format):

0 1 @_EPSILON_SYMBOL@ @_UNKNOWN_SYMBOL_@ 0.0
1 2 @_IDENTITY_SYMBOL@ @_IDENTITY_SYMBOL_@ 0.0
2 0.5

There is also a shorter string for epsilon in AT&T format, @0@.

Regular expressions

The syntax of regular expressions (hfst.regex, hfst.compile_lexc_file, hfst.compile_xfst_file) follows the Xerox formalism, where the following symbols are used instead: 0 for epsilon, and ? for unknown and identity. On either side of a transition, ? means the unknown. As a single symbol, ? means identity-to-identity transition. On both sides of a transition ? means the combination of unknown-to-unknown AND identity-to-identity transitions. If unknown-to-unknown transition is needed, it can be given as the subtraction [?:? - ?]. Some examples:

hfst.regex('0:foo')   # epsilon to "foo"
hfst.regex('0:foo')   # "foo" to epsilon
hfst.regex('?:foo')   # any symbol to "foo"
hfst.regex('?:foo')   # "foo" to any symbol
hfst.regex('?:?')     # any symbol to any symbol
hfst.regex('?')       # any symbol to the same symbol
hfst.regex('?:? - ?') # any symbol to any other symbol

Note that unknowns and identities are expanded with the symbols that the expression becomes aware of during its compilation:

hfst.regex('?')           # equal to [?]
hfst.regex('? foo')       # equal to [[?|foo] foo]
hfst.regex('? foo bar:?') # equal to [[?|foo|bar] foo [bar:?|bar:bar|bar:foo]]

Also note that flag diacritics contain @ so they must be escaped with quotes if used in regular expressions, e.g. "@U.FOO.ON@".

Lookup and string extraction

In lookup, the epsilon is printed as empty string and unknowns and identities as those symbols that they are matched with (todo: for OL format, unknowns on output side are not expanded):

>>> tr = hfst.regex('foo:0 bar:? ?')
>>> print(tr.lookup('foobara'))
(('bara', 0.0), ('fooa', 0.0))

In extract_paths, epsilon, unknown and identity are printed as such:

>>> tr = hfst.regex('foo:0 bar:? ?')
>>> print(tr.extract_paths())
{'foobar@_IDENTITY_SYMBOL_@': [('@_EPSILON_SYMBOL_@@_UNKNOWN_SYMBOL_@@_IDENTITY_SYMBOL_@', 0.0), ('@_EPSILON_SYMBOL_@bar@_IDENTITY_SYMBOL_@', 0.0), ('@_EPSILON_SYMBOL_@foo@_IDENTITY_SYMBOL_@', 0.0)],
 'foobarfoo': [('@_EPSILON_SYMBOL_@@_UNKNOWN_SYMBOL_@foo', 0.0), ('@_EPSILON_SYMBOL_@barfoo', 0.0), ('@_EPSILON_SYMBOL_@foofoo', 0.0)],
 'foobarbar': [('@_EPSILON_SYMBOL_@@_UNKNOWN_SYMBOL_@bar', 0.0), ('@_EPSILON_SYMBOL_@barbar', 0.0), ('@_EPSILON_SYMBOL_@foobar', 0.0)]}

Converting strings into transducers

In hfst.fst() and hfst.fsa(), unknowns and identities possible, but not expanded and must be pre-tokenised (this is true for all special and multi character symbols).

Other functions

In hfst.compile_sfst_file, unknowns and identities cannot be used as they are not part of the SFST formalism. Also, the symbol for epsilon is <> and agreement variables must be used instead of flag diacritics.