You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a number of issues with the collation rules in ICU syntax that it would be good to resolve. I think a short example might help. Here is the first line of a simple sort order specification: a/A aa á/Á, and the resulting start of the generated ICU style collation tailoring: [before 1] [first regular] < a\/A << aa << á\/Á.
Looking at how ICU parses rule strings, it distinguishes strings and syntactic elements. Thus < is a syntactic element as is /. Thus a/A is parsed as 3 elements a/ and A which is an expansion that effectively says sort a after the previously element with an A appended. On the other hand if / is escaped, as in a\/A (as per generated LDML) that treats the / as part of the string and is parsed as a single string of a/A. Which is not what is wanted either. The correct way to interpret / in the simple ordering is to treat it as a 3rd level thus a/A would convert to a <<< A.
In general, this means that:
syntactic parts of the collation rule should not be escaped
syntactic elements that are part of collation element strings, should be escaped
I think this means you can't just run the whole collation rule through a general escaper/unescaper. Instead the escaping needs to be inserted when the collation rule is generated from the simple rules. I.e. the ICU generator produces syntactically correct ICU tailoring from the get go and that just gets copied into the LDML inside a CDATA section. No extra escaping is needed outside of what ICU wants to see.
And just to rub it in. The current LDML collation rules, therefore, are junky and cannot be used by any other tools.
For example, when I read in LDML from DBL bundles, I dump the ICU collation and regenerate it (complete with minimisation) from the simple order. I notice that SIL.WritingSystems does the same in ignoring the ICU tailoring, which could explain why the generated ICU rules aren't getting any testing?
The text was updated successfully, but these errors were encountered:
SIL.WritingSystem/LdmlCollationParser.cs. The output is a simple copy of the data directly, but the parser does some transformation of the tailoring string. I wonder if it should be the other way around and the generation from Simple to ICU would do all the escaping. The mapping from LDML to ICU is 1:1 with no transformation needed.
There are a number of issues with the collation rules in ICU syntax that it would be good to resolve. I think a short example might help. Here is the first line of a simple sort order specification:
a/A aa á/Á
, and the resulting start of the generated ICU style collation tailoring:[before 1] [first regular] < a\/A << aa << á\/Á
.Looking at how ICU parses rule strings, it distinguishes strings and syntactic elements. Thus
<
is a syntactic element as is/
. Thusa/A
is parsed as 3 elementsa
/
andA
which is an expansion that effectively says sorta
after the previously element with anA
appended. On the other hand if/
is escaped, as ina\/A
(as per generated LDML) that treats the/
as part of the string and is parsed as a single string ofa/A
. Which is not what is wanted either. The correct way to interpret/
in the simple ordering is to treat it as a 3rd level thusa/A
would convert toa <<< A
.In general, this means that:
I think this means you can't just run the whole collation rule through a general escaper/unescaper. Instead the escaping needs to be inserted when the collation rule is generated from the simple rules. I.e. the ICU generator produces syntactically correct ICU tailoring from the get go and that just gets copied into the LDML inside a CDATA section. No extra escaping is needed outside of what ICU wants to see.
And just to rub it in. The current LDML collation rules, therefore, are junky and cannot be used by any other tools.
For example, when I read in LDML from DBL bundles, I dump the ICU collation and regenerate it (complete with minimisation) from the simple order. I notice that SIL.WritingSystems does the same in ignoring the ICU tailoring, which could explain why the generated ICU rules aren't getting any testing?
The text was updated successfully, but these errors were encountered: