RE: F&O comments: collations, code points, and comparisons from Kay, Michael on 2002-12-11 (public-qt-comments@w3.org from December 2002)

From: Kay, Michael <Michael.Kay@softwareag.com>
Date: Wed, 11 Dec 2002 12:01:21 +0100
To: xquery@attbi.com, public-qt-comments@w3.org
Cc: mrys@microsoft.com
Message-ID: <DFF2AC9E3583D511A21F0008C7E621060453DE9A@daemsg02.software-ag.de>
Personal replies to some of your points...

> 
> Mostly editorial comments on the F&O Nov 15 draft (these also 
> still apply to the internal Dec 10 draft; section numbers 
> refer to the Dec 10 draft for your convenience).
> 
> 
> - 6.3.1: The definition of compare() explains what happens 
> when one string differs in length from the other; but this 
> should be up to the collation.

I've made this point in the past, and I agree with it. I think we have now
established that functions like contains() and starts-with() do need a
collation that has this property (described in the last NOTE in section
6.3), but functions that purely compare for equality and ordering do not.

> 
> - 6.4.6, 6.4.7, 6.4.14: Surrogate pairs are irrelevant.  
> You've already defined things in terms of code points -- so 
> the underlying bytes (and therefore, surrogate pairs) never 
> come into play.

Technically, you are correct that this note is redundant. However, since so
many other programming languages that claim to have Unicode support actually
treat a char as a 16-bit code unit rather than a Unicode character, I think
it's important that we make this point. Some XSLT 1.0 implementations are
non-compliant in this area and it's very useful to have a definitive
statement in the spec that proves they are wrong.
> 
> - 9.2.1, 10.2.1, 12.1.1: should all compare according to the 
> context collation

9.2.1: QNames should NOT be compared using a collation, they should be
compared using Unicode code points, as described in the XML 1.0 (or perhaps
XML 1.1) specification.

10.2.1 There is still some debate about exactly how anyURIs should be
compared, for example how escapes are handled. We're monitoring the
discussion on this in the W3C TAG. However, URIs are not natural language
text and it certainly doesn't make sense to use the same algorithm as when
comparing strings.

12.1.1 NOTATION is an XML concept (and a pretty obscure one at that) and we
should follow the XML rules for comparison, which are based on code point
comparison. 
> 
> - 6.3, etc.: As Jeni Tennison already brought up [1], URIs as 
> collation names are unusual (and not even followed by the 
> draft itself).  Although the idea has merit for WS-I, almost 
> every collation implementation I can find uses RFC 1766 
> (locale names like en-US and fr-FR).  Perhaps some 
> implementations will invent a URI syntax for their 
> collations, but I expect most Java and .NET implementations 
> will rely on java.text.RuleBasedCollation and 
> System.Globalization.CultureInfo, both of which are based on 
> RFC 1766.  If you're going to insist on URIs, then at least 
> make the draft examples consistent with that.

We've been through a few rounds on this and no-one has come up with a
satisfactory alternative. Locale names do not identify collations, they only
identify communities that may have preferences for a particular collation.
Within a locale such as en-GB, you will find that lexicographers,
geographers, and compilers of telephone directories use completely different
collations. So a locale name can only be a hint.

I think that all the examples do use valid anyURI values (or at least,
strings that can be cast to anyURI). The big question in this area is issue
44, which asks about the meaning of relative URIs and suggests that we
should require the anyURI to be absolute.

For use in XSLT, it would be much more consistent with existing practice to
use a QName, but it would be difficult to define a meaning for this outside
the context of a stylesheet.

I think the biggest problem we face in this area is how to achieve a level
of interoperability. I hope that vendors will provide mechanisms that allow
the URI used in a query to be defined by the user and mapped to some
collation offered by the implementation - see the way saxon:collation works
in Saxon 7.x for an example of how this might be done.

> 
> - Speaking of Jeni's prior feedback, I'd like to echo the 
> request for title-case().  

My Personal View Is That Title Case Is Used Only In North America, and we
are trying to restrict ourselves to functions that have global appeal. But
in the end, deciding whether to include or exclude particular functions is a
matter of judgement.

Michael Kay
Received on Wednesday, 11 December 2002 06:01:28 UTC