RE: PLease define 'collation' from Michael Kay on 2004-06-10 (public-qt-comments@w3.org from June 2004)

From: Michael Kay <mhk@mhk.me.uk>
Date: Thu, 10 Jun 2004 09:12:12 +0100
To: "'Jim Melton'" <jim.melton@acm.org>, "'Igor Hersht'" <igorh@ca.ibm.com>
Cc: <ashokmalhotra@alum.mit.edu>, <public-qt-comments@w3.org>, <Stephen.Buxton@oracle.com>
Message-Id: <20040610081252.3A0CBA09DA@frink.w3.org>
I don't think my attempt at a definition is necessarily the last word on the
matter.
 
Let's try to put this into context. We've made a lot of attempts in the
language design to achieve things like making "eq" transitive, so that
optimizers can do their stuff. If we want to do optimizations like using an
index to support a "starts-with" query, then we need to make some
assumptions about what a collation can and cannot do: a collation that
returned true for the expression startswith("A", "Z", "collation-uri") would
make any such optimizations fraught, to say the least.
 
In my product I allow people to create user-defined collations as
implementations of the class java.text.Collator. That class definition in
itself imposes almost no constraints. I need to tell people "Saxon is making
these assumptions about collations, it's your responsibility if you
implement your own collation to make sure that it satisfies these
constraints". For example, an antistrophic collation (one that sorts strings
in alphabetical order of the reversed string - very useful for crosswords)
would satisfy the ordering rules, but what should "starts-with" mean under
such a collation? My attempt at a definition gives one possible answer to
that question. 
 
Perhaps this is an implementation problem: I can make any rules I like for
what kind of collation my implementation will handle. But I feel it's more
than that. It's not enough for us to specify that starts-with(A, B, C) means
whatever the collation C says it means - which is essentially what your
definition is doing. If that were the case, we might as well not bother
having a standard starts-with function. I think for example that users are
entitled to assume that if starts-with(A, B, C) is true then contains(A, B,
C) is also true, for any collation. I'm searching for a way of expressing
that expectation, and others like it, in a more rigorous definition of
"collation".
 
I had assumed that Oracle's comment asking for a definition of "collation"
was complaining that our current definition is too woolly.
 
Michael Kay


  _____  

From: Jim Melton [mailto:jim.melton@acm.org] 
Sent: 10 June 2004 00:28
To: Igor Hersht
Cc: Michael Kay; ashokmalhotra@alum.mit.edu; public-qt-comments@w3.org;
Stephen.Buxton@oracle.com
Subject: RE: PLease define 'collation'


Gentlepeople,

My apologies for entering this discussion rather late, even though it is one
of my favorite topics; I've been out of the country on business and am just
now catching up on email. 

The most pithy definition of "collation" that I can devise would read
something like this: 

collation: A specification of the manner in which character strings are
compared and, by extension, ordered. 

That definition says absolutely nothing about the technology used to perform
the comparisons/orderings, nor about how to specify a collation in any
particular context.  I think those omissions are a strength of the
definition.  Such a definition does not preclude collations based on the
Unicode Collation Algorithm (UCA), proprietary mechanisms, or even so-called
"phone book" collations.  I, in agreement with the XML Query WG (and,
presumably, the XSL WG), would oppose any definition that might preclude
some collation that our implementations might use or that our customers
might demand. 

Igor, you have raised some interesting points, but I don't think that we are
in any disagreement about the goals.  Nonetheless, I think that I do
disagree with your statement that the UCA "cannot be implemented correctly
when you compare or match just parts of the strings (represented by the
collation units)".  Perhaps it's a matter of interpretation, because I
believe that such comparisons can be done, but (as I think you said) the
collation units for the entire set of strings must be computed in order for
them to be done.  You argue that this cannot be implemented in a reasonable
period of time, but others may well disagree (indeed, some may already have
implemented such facilities), so this is not a useful argument against such
a requirement.  As Mike Kay said, "Whether a real collation actually
operates in this way is irrelevant, it only needs to produce the same
results as if it did so".

I especially disagree with your statement that "Just anoter example from the
Unicode specs which theoretically cannot be implemented (for contains or any
other collation function) using just collation elements".  I am convinced,
after inspecting UTR #10 and spending a bit of time thinking about this,
that matching such as that required by fn:contains() can readily be
implemented using just collation elements.  It might (or might not) be
claimed that doing so would be time-consuming or perhaps inefficient, but
"theoretically cannot be implemented" is very hard to swallow.  That seems
tantamount to a claim that the UCA cannot be implemented ("for...any other
collation function"), even in theory, which is patently absurd.  Surely
fn:compare($arg1, $arg2, $collation) is "any...collation function".  Do you
really mean to imply that it is theoretically impossible to implement that
function? 

Mike also asked about what assumptions can a system make about a collation,
such as transitivity, symmetry, etc.  This is an area fraught with peril,
except when the nature of the collation is generally known.  For example, I
share your belief that collations based on the UCA are transitive,
symmetrical, etc., but I can easily imagine other collations that do not
share all of, perhaps any of, those properties.  That makes it dangerous for
a system to make universal assumptions.  Of course, a partial solution
(which I think I could support) to this problem is to say that the results
of collations that do not support those properties is
implementation-defined. 

(Mike, with respect, I am troubled by your lengthy, almost algorithmic,
definition of a collation, in part because it seems to presume something
very like the UCA, but also in part because I see no need for such detailed
semantics to be included in the definition.  I strongly prefer my much more
terse and general definition.  For similar reasons, I am uncomfortable with
Ashok's proposed definition; again, it goes into too much detail and I don't
think we need to provide a tutorial on the possible behaviors that
collations can be built to provide.)

Hope this helps,
   Jim


========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063              Personal email: jim at melton dot name
USA                                                Fax : +1.801.942.3345
========================================================================
=  Facts are facts.  However, any opinions expressed are the opinions  =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
========================================================================
Received on Thursday, 10 June 2004 04:12:52 UTC