RE: PLease define 'collation'

I am trying to have a collation definition which
1. Includes Unicode Collation ?definition?.
2. Collation libraries already being used and written (that DO follow
  UCA notion) could be used without a serious rework.
3. Collation definition be flexible enough to include collation
definitions, which DO NOT
 follow UCA notation (e.g. codepoint collation).

> Do you see a place where our definition would you preclude
>from doing so?

Yes. Unicode string comparison or matching (all XSLT collation functions)
cannot be implemented correctly when you compare or match just parts of the
strings (represented by the collation units). You have to have something
(sort key) which
represents whole string and calculated according to the collation
parameters.


Igor Hersht
XSLT Development
IBM Canada Ltd., 8200 Warden Avenue, Markham, Ontario L6G 1C7
Office D2-260, Phone (905)413-3240 ; FAX  (905)413-4839


                                                                           
             "Michael Rys"                                                 
             <mrys@microsoft.                                              
             com>                                                       To 
                                      Igor Hersht/Toronto/IBM@IBMCA,       
             06/09/2004 04:39         "Michael Kay" <mhk@mhk.me.uk>        
             PM                                                         cc 
                                      <ashokmalhotra@alum.mit.edu>,        
                                      <public-qt-comments@w3.org>,         
                                      <Stephen.Buxton@oracle.com>          
                                                                   Subject 
                                      RE: PLease define 'collation'        
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




I think the point we are trying to make is that there are collation
libraries already being used and written that do not follow UCA notion
of a collation algorithm. Since we want to enable their use, our
definition must not preclude them. Nobody forces you to implement them,
you can use the JDK or ICU version as well.

Do you see a place where our definition would you preclude from doing
so?

Thanks
Michael

> -----Original Message-----
> From: public-qt-comments-request@w3.org [mailto:public-qt-comments-
> request@w3.org] On Behalf Of Igor Hersht
> Sent: Wednesday, June 09, 2004 1:33 PM
> To: Michael Kay
> Cc: ashokmalhotra@alum.mit.edu; public-qt-comments@w3.org;
> Stephen.Buxton@oracle.com
> Subject: RE: PLease define 'collation'
>
>
>
>
>
>
> > Our collations aren't restricted to conform to the
> >Unicode Collation Algorithm. This is a much more
> >abstract definition of what a collation is.
>
> I agree that a collation definition should not be restricted to
> the Unicode collation (there is no such restriction in my proposal).
> In my opinion the definition should also:
> 1. Include the Unicode collation.
> 2. Be implementable within a reasonable time limits.
> 3. Have reasonable performance.
>
>
> >I don't think there's any question that any conceivable ordering
> >of strings can be achieved by mapping the strings to sequences of
> integers
> >(indeed, single integers would do perfectly well)
> Of course, The only problem here is performance. One would need to
> recalculate all sequence elements on any string added.
> We wouldn?t need such recalculations if we used sorting keys
> (which is a byte array).
>
> > The question that arises is whether this also supports
> >contains. I would argue that it does, by construction.
>
> That is where I disagree.
> The only mapping from a string to a sequence of integers
> (which is defined in the Unicode Collation Algorithm and implemented
> in the Java JDK and the ICU) is mapping from this sting to the
collation
> elements. Implementing our own mapping would mean rewriting the
Unicode
> collation functionality and cannot be implemented within a reasonable
time
> limits.  Using collation elements (not sorting key) for comprising
> and string matching has fundamental problems I mentioned in my
previous
> note => cannot be implemented within a reasonable time limits. Just
anoter
> example from the Unicode specs which theoretically cannot be
implemented
> (for contains or any other collation function) using just collation
> elements.
>
> ?Characters may also have different case mappings, depending on the
> context.
> For example, U+03A3  capital sigma lowercases to U+03C3  small sigma
if it
> is followed by another letter, but lowercases to U+03C2  small final
sigma
> if it is not.?
> I assume here Greek collation with case-order ignored.
>
> >These are presumably not the existing functions
> >called fn:compare and fn:match? Are these standard functions that we
> >define,or are they functions peculiar to each collation?
>
> I think it could be good idea to have such functions exposed
> publicly. But it is not that important. This is just 2 functions
> with 3 parameters.
> > I can't infer from your definition any constraints on what a
collation
> >produces, e.g. any relationship between equals and contains.
>
> contains is always true if  equals is true.
>
>
>
> Igor Hersht
> XSLT Development
> IBM Canada Ltd., 8200 Warden Avenue, Markham, Ontario L6G 1C7
> Office D2-260, Phone (905)413-3240 ; FAX  (905)413-4839
>
>
>
>              "Michael Kay"
>              <mhk@mhk.me.uk>
>
To
>              06/08/2004 07:57         Igor Hersht/Toronto/IBM@IBMCA
>              PM
cc
>                                       <public-qt-comments@w3.org>,
>                                       <ashokmalhotra@alum.mit.edu>,
>                                       <Stephen.Buxton@oracle.com>
>
Subject
>                                       RE: PLease define 'collation'
>
>
>
>
>
>
>
>
>
>
> >
> > >A collation is a mapping from strings to sequences of integers,
> > >referred to as collation units. This mapping can be described as a
> >  >function C(xs:string)->xs:integer*.
> >
> > >Two strings are considered equal if they map to the same
> > >sequence of collation units.
> >
> > Problems.
> >
> > The definition does not correspond the Unicode Collation
> > Algorithm (see
> > http://www.unicode.org/unicode/reports/tr10/#Main_Algorithm).
> > The algorithm consist of 3 steps
> > Step 1. Produce a normalized form of each input string.
> > Step 2. The collation element array is built
> > Step 3. The sort key is formed.
> > String comprising and matching based on the calculated values of
> > the sort key.
>
> Our collations aren't restricted to conform to the Unicode Collation
> Algorithm. This is a much more abstract definition of what a collation
is.
> >
> > As far as I understand  the term ?collation unit? should corresponds
> > the Unicode term ?collation element?.  The step 3 is missing in the
> > collation
> > definition . Doing comprising based only on the collation
> > element is incorrect. The main problem is that collation elements
> > represent just part of the string. The sort key represents
> > whole string.
>
> I think this is partly because UCA is a parameterisable algorithm,
whereas
> our concept of a collation has all parameters already bound. I don't
think
> there's any question that any conceivable ordering of strings can be
> achieved by mapping the strings to sequences of integers (indeed,
single
> integers would do perfectly well) and then sorting the numeric
> representations. The question that arises is whether this also
supports
> contains. I would argue that it does, by construction.
> >
> > Some of the problems (e.g. Contextual Sensitivity see 1.3 Contextual
> > Sensitivity)
> > would require very serious rework of the existing collation
> > implementations
> > (the rework could be more time consuming than the rest of
> > XSLT2.0 and XPath
> > functionality). Just an example from this chapter (which
> > works fine with
> > the
> > ICU Unicode Collation Algorithm based implementation).
> > ?In French and a few other languages, however, it is the last accent
> > difference
> > that determines the order?
> > Normal Accent Ordering  {'c','o','t', 0x00EA}  < {'c',0x00F4,'t',
'e'}
> > French Accent Ordering  {'c','o','t', 0x00EA}  > {'c',0x00F4,'t',
'e'}
>
> I think you are looking at this too much from an implementation
> perspective.
> I was trying to find an abstract definition of what a collation is. As
I
> said, for equality and ordering it would be quite enough to say that a
> collation is a mapping from strings to integers (the position in the
final
> sequence). I was hoping that by saying it is a mapping to a sequence
of
> integers then we can also imply some properties of functions like
> contains(), for example that contains(A,B) is true if A=B is true, and
> that
> startswith(A, B) implies A <= B.
>
> The question emerged recently of whether a collation is guaranteed to
> implement equality in a way that is transitive. Under my definition,
it
> is.
> We haven't previously had a definition that imposed this constraint.
> >
> > There is another simple solution which is to create  sort key from
the
> > collation
> > elements and map the key to an integer sequence. I don?t
> > think that the
> > collation
> > definition make sense here.
>
> >
> >
> > Solution
> >
> > 2. Common collation definitions
> > A collation is a parameter for 2 functions fn:compare and fn:match.
>
> I'm not following you. These are presumably not the existing functions
> called fn:compare and fn:match? Are these standard functions that we
> define,
> or are they functions peculiar to each collation?
>
> I can't infer from your definition any constraints on what a collation
> produces, e.g. any relationship between equals and contains.
>
> Michael Kay
>
> >
> > fn:compare( $arg1  as xs:string?, $arg2  as xs:string?,
> > $collation  as xs:string) as xs:integer
> > Two strings (arg1 and arg2) are considered equal if
> > fn:compare returns 0.
> > A string arg1 is considered greater than a string arg2 if
fn:compare()
> > return is more than 0.
> >
> > fn:match( $arg1  as xs:string?, $arg2  as xs:string?, $collation  as
> > xs:string) as xs:integer*
> > fn:match returns 2 integer sequence  (S(start,end)) or an
> > empty ESsequence.
> >
> > Definitions of fn:contains, fn:starts-with, fn:ends-with,
> > fn:substring-before and fn:substring-after are quiet obvious in
terms
> > of fn:match.
> >
> > fn:contains return true if fn:match returns not empty sequence.
> > fn:starts-with return true if fn:match returns not empty sequence
> > and start = 0.
> > fn:ends-with return true if fn:match returns not empty sequence
> > and end = arg1.length() -1.
> > fn:substring-before an empty string if fn:match returns an
> > empty sequence.
> > Otherwise fn:substring-before returns arg1.substring(0,start).
> > fn:substring-afret an empty string if fn:match returns an
> > empty sequence.
> > Otherwise fn:substring-before returns
> > arg1.substring(end, arg1.length()).
> >
> > 2.1 Specific collations
> > Specific collations should be defined as definitions for
> > fn:compare and fn:match functions.
> >
> >
> > 2.1.1 Unicode collation
> > fn:compare and fn:match for the Unicode collation.
> > defined in http://www.unicode.org/unicode/reports/tr10.
> > The Unicode collation define different kinds of matching.
> > I think we can use an ambiguous -minimal one the(see 8 Searching and
> > Matching).
> > It could be good idea to define the maximal one (seems to
> > corresponds a
> > common
> > sense). The match is maximal if for all positive i and j,
> > there is no match
> > at Q[s-i,e+j].
> >
> > 2.1.2 The Unicode codepoint collation
> > fn:compare and fn:match is obvious here and corresponds and
> > corresponds to
> > the
> > Michael Kay's one.
> >
> > 2.1.3  Other collations.
> > Other collations are implementation defined.
> >
> >
> >
> > Igor Hersht
> > XSLT Development
> > IBM Canada Ltd., 8200 Warden Avenue, Markham, Ontario L6G 1C7
> > Office D2-260, Phone (905)413-3240 ; FAX  (905)413-4839
> >
>
>
>

Received on Wednesday, 9 June 2004 17:15:07 UTC