W3C home > Mailing lists > Public > public-qt-comments@w3.org > April 2004

Re: [F&O] IBM-FO-104: Description of substring matching should account for ignorable collations units

From: Jim Melton <jim.melton@acm.org>
Date: Fri, 09 Apr 2004 14:32:02 -0600
Message-Id: <>
To: "Michael Kay" <mhk@mhk.me.uk>
Cc: <public-qt-comments@w3.org>

Thanks to Mike for this excellent analysis and experiment.  I (speaking as 
an individual who cares deeply about internationalization and collations) 
concur with his conclusions and his reasoning.


At 12:38 PM 4/9/2004 Friday, Michael Kay wrote:

>In message:
>Henry Zongaro raised the question of ignorable collation units and their
>effect on functions such as contains() and substring-before().
>I've been doing a bit of investigation as to what Java does.
>Using the collation (which maps in a fairly obvious way to a Java
>let $coll := "http://saxon.sf.net/collation?lang=en;strength=primary"
>I get
>compare("in-scope", "inscope", $coll) = 0
>so it appears this is a collation in which hyphen is "ignorable".
>But it turns out that Java is actually generating 8 collation units for the
>first string, and only 7 for the second. It is treating the strings as equal
>because (I think) the difference between "-" and "" is a tertiary
>difference, and tertiary differences are ignored when the collation strength
>is primary.
>Using the same collation, I get:
>contains("in-scope", "-", $coll) = true
>contains("in-scope", "inscope", $coll) = false
>substring-before("in-scope", "-") = "in"
>substring-after("in-scope", "-") = "scope"
>So as far as I can see, Java side-steps the problem in the Unicode algorithm
>that the comment refers to. The hyphen is not really an "ignorable"
>character at all, it generates a collation unit which is ignored at certain
>levels. Therefore, the fact that hyphen is ignored in equality testing at a
>certain level does not affect the results of the contains() function and its
>friends, which produce the expected results.
>Having established that Java has no problem handling ignorables here, I'm
>not sure what our specs need to say about the situation. I think it's a
>non-problem and we should avoid mentioning it. (It's interesting, though,
>that A eq B can be true, when contains(A,B) is false, under the same
>Michael Kay
>(personal contribution)

Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063              Personal email: jim at melton dot name
USA                                                Fax : +1.801.942.3345
=  Facts are facts.  However, any opinions expressed are the opinions  =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
Received on Friday, 9 April 2004 16:37:58 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:56:56 UTC