- From: Jim Melton <jim.melton@acm.org>
- Date: Fri, 09 Apr 2004 14:32:02 -0600
- To: "Michael Kay" <mhk@mhk.me.uk>
- Cc: <public-qt-comments@w3.org>
Thanks to Mike for this excellent analysis and experiment. I (speaking as an individual who cares deeply about internationalization and collations) concur with his conclusions and his reasoning. Jim At 12:38 PM 4/9/2004 Friday, Michael Kay wrote: >In message: >http://lists.w3.org/Archives/Public/public-qt-comments/2004Feb/0972.html > >Henry Zongaro raised the question of ignorable collation units and their >effect on functions such as contains() and substring-before(). > >I've been doing a bit of investigation as to what Java does. > >Using the collation (which maps in a fairly obvious way to a Java >comparator) > >let $coll := "http://saxon.sf.net/collation?lang=en;strength=primary" > >I get > >compare("in-scope", "inscope", $coll) = 0 > >so it appears this is a collation in which hyphen is "ignorable". > >But it turns out that Java is actually generating 8 collation units for the >first string, and only 7 for the second. It is treating the strings as equal >because (I think) the difference between "-" and "" is a tertiary >difference, and tertiary differences are ignored when the collation strength >is primary. > >Using the same collation, I get: > >contains("in-scope", "-", $coll) = true >contains("in-scope", "inscope", $coll) = false >substring-before("in-scope", "-") = "in" >substring-after("in-scope", "-") = "scope" > >So as far as I can see, Java side-steps the problem in the Unicode algorithm >that the comment refers to. The hyphen is not really an "ignorable" >character at all, it generates a collation unit which is ignored at certain >levels. Therefore, the fact that hyphen is ignored in equality testing at a >certain level does not affect the results of the contains() function and its >friends, which produce the expected results. > >Having established that Java has no problem handling ignorables here, I'm >not sure what our specs need to say about the situation. I think it's a >non-problem and we should avoid mentioning it. (It's interesting, though, >that A eq B can be true, when contains(A,B) is false, under the same >collation). > >Michael Kay >(personal contribution) ======================================================================== Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 Oracle Corporation Oracle Email: jim dot melton at oracle dot com 1930 Viscounti Drive Standards email: jim dot melton at acm dot org Sandy, UT 84093-1063 Personal email: jim at melton dot name USA Fax : +1.801.942.3345 ======================================================================== = Facts are facts. However, any opinions expressed are the opinions = = only of myself and may or may not reflect the opinions of anybody = = else with whom I may or may not have discussed the issues at hand. = ========================================================================
Received on Friday, 9 April 2004 16:37:58 UTC