RE: PLease define 'collation'

>The most pithy definition of "collation" that I can devise would read
something like this: >collation: A specification of the manner in which
character strings are compared and, by >extension, ordered.

This is very similar to what I was trying to say.  I think that there is
one point
missing here - string matching. String "a" could be equal both "a" and
"a-".
Having just comparison would not give us unambiguous matching.

I would say
collation: A specification of the manner  which defines character strings
comparing  ( by extension ordered) and matching.

(May be it could be expressed in better English).

Defining of the collation functions is a separate issue.
I think that the definition of the functions
 fn:contains, fn:starts-with, fn:ends-with,
fn:substring-before and fn:substring-after are quiet obvious
in terms of string matching.


Igor Hersht
XSLT Development
IBM Canada Ltd., 8200 Warden Avenue, Markham, Ontario L6G 1C7
Office D2-260, Phone (905)413-3240 ; FAX  (905)413-4839


                                                                           
             Jim Melton                                                    
             <jim.melton@acm.                                              
             org>                                                       To 
                                      Igor Hersht/Toronto/IBM@IBMCA        
             06/09/2004 08:27                                           cc 
             PM                       "Michael Kay" <mhk@mhk.me.uk>,       
                                      ashokmalhotra@alum.mit.edu,          
                                      public-qt-comments@w3.org,           
                                      Stephen.Buxton@oracle.com            
                                                                   Subject 
                                      RE: PLease define 'collation'        
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Gentlepeople,

My apologies for entering this discussion rather late, even though it is
one of my favorite topics; I've been out of the country on business and am
just now catching up on email.

The most pithy definition of "collation" that I can devise would read
something like this:
      collation: A specification of the manner in which character strings
      are compared and, by extension, ordered.
That definition says absolutely nothing about the technology used to
perform the comparisons/orderings, nor about how to specify a collation in
any particular context.  I think those omissions are a strength of the
definition.  Such a definition does not preclude collations based on the
Unicode Collation Algorithm (UCA), proprietary mechanisms, or even
so-called "phone book" collations.  I, in agreement with the XML Query WG
(and, presumably, the XSL WG), would oppose any definition that might
preclude some collation that our implementations might use or that our
customers might demand.

Igor, you have raised some interesting points, but I don't think that we
are in any disagreement about the goals.  Nonetheless, I think that I do
disagree with your statement that the UCA "cannot be implemented correctly
when you compare or match just parts of the strings (represented by the
collation units)".  Perhaps it's a matter of interpretation, because I
believe that such comparisons can be done, but (as I think you said) the
collation units for the entire set of strings must be computed in order for
them to be done.  You argue that this cannot be implemented in a reasonable
period of time, but others may well disagree (indeed, some may already have
implemented such facilities), so this is not a useful argument against such
a requirement.  As Mike Kay said, "Whether a real collation actually
operates in this way is irrelevant, it only needs to produce the same
results as if it did so".

I especially disagree with your statement that "Just anoter example from
the Unicode specs which theoretically cannot be implemented (for contains
or any other collation function) using just collation elements".  I am
convinced, after inspecting UTR #10 and spending a bit of time thinking
about this, that matching such as that required by fn:contains() can
readily be implemented using just collation elements.  It might (or might
not) be claimed that doing so would be time-consuming or perhaps
inefficient, but "theoretically cannot be implemented" is very hard to
swallow.  That seems tantamount to a claim that the UCA cannot be
implemented ("for...any other collation function"), even in theory, which
is patently absurd.  Surely fn:compare($arg1, $arg2, $collation) is
"any...collation function".  Do you really mean to imply that it is
theoretically impossible to implement that function?

Mike also asked about what assumptions can a system make about a collation,
such as transitivity, symmetry, etc.  This is an area fraught with peril,
except when the nature of the collation is generally known.  For example, I
share your belief that collations based on the UCA are transitive,
symmetrical, etc., but I can easily imagine other collations that do not
share all of, perhaps any of, those properties.  That makes it dangerous
for a system to make universal assumptions.  Of course, a partial solution
(which I think I could support) to this problem is to say that the results
of collations that do not support those properties is
implementation-defined.

(Mike, with respect, I am troubled by your lengthy, almost algorithmic,
definition of a collation, in part because it seems to presume something
very like the UCA, but also in part because I see no need for such detailed
semantics to be included in the definition.  I strongly prefer my much more
terse and general definition.  For similar reasons, I am uncomfortable with
Ashok's proposed definition; again, it goes into too much detail and I
don't think we need to provide a tutorial on the possible behaviors that
collations can be built to provide.)

Hope this helps,
   Jim


========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063              Personal email: jim at melton dot name
USA                                                Fax : +1.801.942.3345
========================================================================
=  Facts are facts.  However, any opinions expressed are the opinions  =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
========================================================================

Received on Thursday, 10 June 2004 12:08:53 UTC