Re: [F&O] IBM-FO-104: Description of substring matching should account for ignorable collations units

Thanks to Mike for this excellent analysis and experiment.  I (speaking as 
an individual who cares deeply about internationalization and collations) 
concur with his conclusions and his reasoning.

Jim

At 12:38 PM 4/9/2004 Friday, Michael Kay wrote:

>In message:
>http://lists.w3.org/Archives/Public/public-qt-comments/2004Feb/0972.html
>
>Henry Zongaro raised the question of ignorable collation units and their
>effect on functions such as contains() and substring-before().
>
>I've been doing a bit of investigation as to what Java does.
>
>Using the collation (which maps in a fairly obvious way to a Java
>comparator)
>
>let $coll := "http://saxon.sf.net/collation?lang=en;strength=primary"
>
>I get
>
>compare("in-scope", "inscope", $coll) = 0
>
>so it appears this is a collation in which hyphen is "ignorable".
>
>But it turns out that Java is actually generating 8 collation units for the
>first string, and only 7 for the second. It is treating the strings as equal
>because (I think) the difference between "-" and "" is a tertiary
>difference, and tertiary differences are ignored when the collation strength
>is primary.
>
>Using the same collation, I get:
>
>contains("in-scope", "-", $coll) = true
>contains("in-scope", "inscope", $coll) = false
>substring-before("in-scope", "-") = "in"
>substring-after("in-scope", "-") = "scope"
>
>So as far as I can see, Java side-steps the problem in the Unicode algorithm
>that the comment refers to. The hyphen is not really an "ignorable"
>character at all, it generates a collation unit which is ignored at certain
>levels. Therefore, the fact that hyphen is ignored in equality testing at a
>certain level does not affect the results of the contains() function and its
>friends, which produce the expected results.
>
>Having established that Java has no problem handling ignorables here, I'm
>not sure what our specs need to say about the situation. I think it's a
>non-problem and we should avoid mentioning it. (It's interesting, though,
>that A eq B can be true, when contains(A,B) is false, under the same
>collation).
>
>Michael Kay
>(personal contribution)

========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063              Personal email: jim at melton dot name
USA                                                Fax : +1.801.942.3345
========================================================================
=  Facts are facts.  However, any opinions expressed are the opinions  =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
======================================================================== 

Received on Friday, 9 April 2004 16:37:58 UTC