- From: Jim Melton <jim.melton@acm.org>
- Date: Fri, 09 Apr 2004 14:32:02 -0600
- To: "Michael Kay" <mhk@mhk.me.uk>
- Cc: <public-qt-comments@w3.org>
Thanks to Mike for this excellent analysis and experiment. I (speaking as
an individual who cares deeply about internationalization and collations)
concur with his conclusions and his reasoning.
Jim
At 12:38 PM 4/9/2004 Friday, Michael Kay wrote:
>In message:
>http://lists.w3.org/Archives/Public/public-qt-comments/2004Feb/0972.html
>
>Henry Zongaro raised the question of ignorable collation units and their
>effect on functions such as contains() and substring-before().
>
>I've been doing a bit of investigation as to what Java does.
>
>Using the collation (which maps in a fairly obvious way to a Java
>comparator)
>
>let $coll := "http://saxon.sf.net/collation?lang=en;strength=primary"
>
>I get
>
>compare("in-scope", "inscope", $coll) = 0
>
>so it appears this is a collation in which hyphen is "ignorable".
>
>But it turns out that Java is actually generating 8 collation units for the
>first string, and only 7 for the second. It is treating the strings as equal
>because (I think) the difference between "-" and "" is a tertiary
>difference, and tertiary differences are ignored when the collation strength
>is primary.
>
>Using the same collation, I get:
>
>contains("in-scope", "-", $coll) = true
>contains("in-scope", "inscope", $coll) = false
>substring-before("in-scope", "-") = "in"
>substring-after("in-scope", "-") = "scope"
>
>So as far as I can see, Java side-steps the problem in the Unicode algorithm
>that the comment refers to. The hyphen is not really an "ignorable"
>character at all, it generates a collation unit which is ignored at certain
>levels. Therefore, the fact that hyphen is ignored in equality testing at a
>certain level does not affect the results of the contains() function and its
>friends, which produce the expected results.
>
>Having established that Java has no problem handling ignorables here, I'm
>not sure what our specs need to say about the situation. I think it's a
>non-problem and we should avoid mentioning it. (It's interesting, though,
>that A eq B can be true, when contains(A,B) is false, under the same
>collation).
>
>Michael Kay
>(personal contribution)
========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144
Oracle Corporation Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063 Personal email: jim at melton dot name
USA Fax : +1.801.942.3345
========================================================================
= Facts are facts. However, any opinions expressed are the opinions =
= only of myself and may or may not reflect the opinions of anybody =
= else with whom I may or may not have discussed the issues at hand. =
========================================================================
Received on Friday, 9 April 2004 16:37:58 UTC