Re: [F&O] IBM-FO-104: Description of substring matching should account for ignorable collations units

In message:
http://lists.w3.org/Archives/Public/public-qt-comments/2004Feb/0972.html

Henry Zongaro raised the question of ignorable collation units and their
effect on functions such as contains() and substring-before().

I've been doing a bit of investigation as to what Java does.

Using the collation (which maps in a fairly obvious way to a Java
comparator)

let $coll := "http://saxon.sf.net/collation?lang=en;strength=primary"

I get 

compare("in-scope", "inscope", $coll) = 0

so it appears this is a collation in which hyphen is "ignorable".

But it turns out that Java is actually generating 8 collation units for the
first string, and only 7 for the second. It is treating the strings as equal
because (I think) the difference between "-" and "" is a tertiary
difference, and tertiary differences are ignored when the collation strength
is primary.

Using the same collation, I get:

contains("in-scope", "-", $coll) = true
contains("in-scope", "inscope", $coll) = false
substring-before("in-scope", "-") = "in"
substring-after("in-scope", "-") = "scope"

So as far as I can see, Java side-steps the problem in the Unicode algorithm
that the comment refers to. The hyphen is not really an "ignorable"
character at all, it generates a collation unit which is ignored at certain
levels. Therefore, the fact that hyphen is ignored in equality testing at a
certain level does not affect the results of the contains() function and its
friends, which produce the expected results.

Having established that Java has no problem handling ignorables here, I'm
not sure what our specs need to say about the situation. I think it's a
non-problem and we should avoid mentioning it. (It's interesting, though,
that A eq B can be true, when contains(A,B) is false, under the same
collation).

Michael Kay
(personal contribution)

Received on Friday, 9 April 2004 14:38:19 UTC