W3C home > Mailing lists > Public > public-qt-comments@w3.org > June 2004

RE: PLease define 'collation'

From: Jim Melton <jim.melton@acm.org>
Date: Wed, 09 Jun 2004 18:27:36 -0600
Message-Id: <6.0.0.22.2.20040609180312.03466540@gmstimap.oraclecorp.com>
To: Igor Hersht <igorh@ca.ibm.com>
Cc: "Michael Kay" <mhk@mhk.me.uk>, ashokmalhotra@alum.mit.edu, public-qt-comments@w3.org, Stephen.Buxton@oracle.com
Gentlepeople,

My apologies for entering this discussion rather late, even though it is 
one of my favorite topics; I've been out of the country on business and am 
just now catching up on email.

The most pithy definition of "collation" that I can devise would read 
something like this:
collation: A specification of the manner in which character strings are 
compared and, by extension, ordered.
That definition says absolutely nothing about the technology used to 
perform the comparisons/orderings, nor about how to specify a collation in 
any particular context.  I think those omissions are a strength of the 
definition.  Such a definition does not preclude collations based on the 
Unicode Collation Algorithm (UCA), proprietary mechanisms, or even 
so-called "phone book" collations.  I, in agreement with the XML Query WG 
(and, presumably, the XSL WG), would oppose any definition that might 
preclude some collation that our implementations might use or that our 
customers might demand.

Igor, you have raised some interesting points, but I don't think that we 
are in any disagreement about the goals.  Nonetheless, I think that I do 
disagree with your statement that the UCA "cannot be implemented correctly 
when you compare or match just parts of the strings (represented by the 
collation units)".  Perhaps it's a matter of interpretation, because I 
believe that such comparisons can be done, but (as I think you said) the 
collation units for the entire set of strings must be computed in order for 
them to be done.  You argue that this cannot be implemented in a reasonable 
period of time, but others may well disagree (indeed, some may already have 
implemented such facilities), so this is not a useful argument against such 
a requirement.  As Mike Kay said, "Whether a real collation actually 
operates in this way is irrelevant, it only needs to produce the same 
results as if it did so".

I especially disagree with your statement that "Just anoter example from 
the Unicode specs which theoretically cannot be implemented (for contains 
or any other collation function) using just collation elements".  I am 
convinced, after inspecting UTR #10 and spending a bit of time thinking 
about this, that matching such as that required by fn:contains() can 
readily be implemented using just collation elements.  It might (or might 
not) be claimed that doing so would be time-consuming or perhaps 
inefficient, but "theoretically cannot be implemented" is very hard to 
swallow.  That seems tantamount to a claim that the UCA cannot be 
implemented ("for...any other collation function"), even in theory, which 
is patently absurd.  Surely fn:compare($arg1, $arg2, $collation) is 
"any...collation function".  Do you really mean to imply that it is 
theoretically impossible to implement that function?

Mike also asked about what assumptions can a system make about a collation, 
such as transitivity, symmetry, etc.  This is an area fraught with peril, 
except when the nature of the collation is generally known.  For example, I 
share your belief that collations based on the UCA are transitive, 
symmetrical, etc., but I can easily imagine other collations that do not 
share all of, perhaps any of, those properties.  That makes it dangerous 
for a system to make universal assumptions.  Of course, a partial solution 
(which I think I could support) to this problem is to say that the results 
of collations that do not support those properties is implementation-defined.

(Mike, with respect, I am troubled by your lengthy, almost algorithmic, 
definition of a collation, in part because it seems to presume something 
very like the UCA, but also in part because I see no need for such detailed 
semantics to be included in the definition.  I strongly prefer my much more 
terse and general definition.  For similar reasons, I am uncomfortable with 
Ashok's proposed definition; again, it goes into too much detail and I 
don't think we need to provide a tutorial on the possible behaviors that 
collations can be built to provide.)

Hope this helps,
    Jim

========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063              Personal email: jim at melton dot name
USA                                                Fax : +1.801.942.3345
========================================================================
=  Facts are facts.  However, any opinions expressed are the opinions  =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
======================================================================== 
Received on Wednesday, 9 June 2004 20:29:38 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:45:20 UTC