W3C home > Mailing lists > Public > public-qt-comments@w3.org > June 2004

RE: PLease define 'collation'

From: Michael Rys <mrys@microsoft.com>
Date: Wed, 9 Jun 2004 17:32:41 -0700
Message-ID: <EB0A327048144442AFB15FCE18DC96C703200B46@RED-MSG-31.redmond.corp.microsoft.com>
To: "Jim Melton" <jim.melton@acm.org>, "Igor Hersht" <igorh@ca.ibm.com>
Cc: "Michael Kay" <mhk@mhk.me.uk>, <ashokmalhotra@alum.mit.edu>, <public-qt-comments@w3.org>, <Stephen.Buxton@oracle.com>
I like your definition.


Best regards



PS: Jim, what happened to you in China? We seem to start agreeing too
much :-)



From: public-qt-comments-request@w3.org
[mailto:public-qt-comments-request@w3.org] On Behalf Of Jim Melton
Sent: Wednesday, June 09, 2004 5:28 PM
To: Igor Hersht
Cc: Michael Kay; ashokmalhotra@alum.mit.edu; public-qt-comments@w3.org;
Subject: RE: PLease define 'collation'



My apologies for entering this discussion rather late, even though it is
one of my favorite topics; I've been out of the country on business and
am just now catching up on email. 

The most pithy definition of "collation" that I can devise would read
something like this: 

collation: A specification of the manner in which character strings are
compared and, by extension, ordered. 

That definition says absolutely nothing about the technology used to
perform the comparisons/orderings, nor about how to specify a collation
in any particular context.  I think those omissions are a strength of
the definition.  Such a definition does not preclude collations based on
the Unicode Collation Algorithm (UCA), proprietary mechanisms, or even
so-called "phone book" collations.  I, in agreement with the XML Query
WG (and, presumably, the XSL WG), would oppose any definition that might
preclude some collation that our implementations might use or that our
customers might demand. 

Igor, you have raised some interesting points, but I don't think that we
are in any disagreement about the goals.  Nonetheless, I think that I do
disagree with your statement that the UCA "cannot be implemented
correctly when you compare or match just parts of the strings
(represented by the collation units)".  Perhaps it's a matter of
interpretation, because I believe that such comparisons can be done, but
(as I think you said) the collation units for the entire set of strings
must be computed in order for them to be done.  You argue that this
cannot be implemented in a reasonable period of time, but others may
well disagree (indeed, some may already have implemented such
facilities), so this is not a useful argument against such a
requirement.  As Mike Kay said, "Whether a real collation actually
operates in this way is irrelevant, it only needs to produce the same
results as if it did so".

I especially disagree with your statement that "Just anoter example from
the Unicode specs which theoretically cannot be implemented (for
contains or any other collation function) using just collation
elements".  I am convinced, after inspecting UTR #10 and spending a bit
of time thinking about this, that matching such as that required by
fn:contains() can readily be implemented using just collation elements.
It might (or might not) be claimed that doing so would be time-consuming
or perhaps inefficient, but "theoretically cannot be implemented" is
very hard to swallow.  That seems tantamount to a claim that the UCA
cannot be implemented ("for...any other collation function"), even in
theory, which is patently absurd.  Surely fn:compare($arg1, $arg2,
$collation) is "any...collation function".  Do you really mean to imply
that it is theoretically impossible to implement that function? 

Mike also asked about what assumptions can a system make about a
collation, such as transitivity, symmetry, etc.  This is an area fraught
with peril, except when the nature of the collation is generally known.
For example, I share your belief that collations based on the UCA are
transitive, symmetrical, etc., but I can easily imagine other collations
that do not share all of, perhaps any of, those properties.  That makes
it dangerous for a system to make universal assumptions.  Of course, a
partial solution (which I think I could support) to this problem is to
say that the results of collations that do not support those properties
is implementation-defined. 

(Mike, with respect, I am troubled by your lengthy, almost algorithmic,
definition of a collation, in part because it seems to presume something
very like the UCA, but also in part because I see no need for such
detailed semantics to be included in the definition.  I strongly prefer
my much more terse and general definition.  For similar reasons, I am
uncomfortable with Ashok's proposed definition; again, it goes into too
much detail and I don't think we need to provide a tutorial on the
possible behaviors that collations can be built to provide.)

Hope this helps,

Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063              Personal email: jim at melton dot name
USA                                                Fax : +1.801.942.3345
=  Facts are facts.  However, any opinions expressed are the opinions  =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
Received on Wednesday, 9 June 2004 20:33:09 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:45:20 UTC