RE: PLease define 'collation' from Igor Hersht on 2004-06-09 (public-qt-comments@w3.org from June 2004)

From: Igor Hersht <igorh@ca.ibm.com>
Date: Wed, 9 Jun 2004 16:33:01 -0400
To: "Michael Kay" <mhk@mhk.me.uk>
Cc: ashokmalhotra@alum.mit.edu, public-qt-comments@w3.org, Stephen.Buxton@oracle.com
Message-ID: <OF13CA33B4.E0BCD1DE-ON85256EAE.005633EA-85256EAE.0070E20F@ca.ibm.com>
> Our collations aren't restricted to conform to the
>Unicode Collation Algorithm. This is a much more
>abstract definition of what a collation is.

I agree that a collation definition should not be restricted to
the Unicode collation (there is no such restriction in my proposal).
In my opinion the definition should also:
1. Include the Unicode collation.
2. Be implementable within a reasonable time limits.
3. Have reasonable performance.


>I don't think there's any question that any conceivable ordering
>of strings can be achieved by mapping the strings to sequences of integers
>(indeed, single integers would do perfectly well)
Of course, The only problem here is performance. One would need to
recalculate all sequence elements on any string added.
We wouldn?t need such recalculations if we used sorting keys
(which is a byte array).

> The question that arises is whether this also supports
>contains. I would argue that it does, by construction.

That is where I disagree.
The only mapping from a string to a sequence of integers
(which is defined in the Unicode Collation Algorithm and implemented
in the Java JDK and the ICU) is mapping from this sting to the collation
elements. Implementing our own mapping would mean rewriting the Unicode
collation functionality and cannot be implemented within a reasonable time
limits.  Using collation elements (not sorting key) for comprising
and string matching has fundamental problems I mentioned in my previous
note => cannot be implemented within a reasonable time limits. Just anoter
example from the Unicode specs which theoretically cannot be implemented
(for contains or any other collation function) using just collation
elements.

?Characters may also have different case mappings, depending on the
context.
For example, U+03A3  capital sigma lowercases to U+03C3  small sigma if it
is followed by another letter, but lowercases to U+03C2  small final sigma
if it is not.?
I assume here Greek collation with case-order ignored.

>These are presumably not the existing functions
>called fn:compare and fn:match? Are these standard functions that we
>define,or are they functions peculiar to each collation?

I think it could be good idea to have such functions exposed
publicly. But it is not that important. This is just 2 functions
with 3 parameters.
> I can't infer from your definition any constraints on what a collation
>produces, e.g. any relationship between equals and contains.

contains is always true if  equals is true.



Igor Hersht
XSLT Development
IBM Canada Ltd., 8200 Warden Avenue, Markham, Ontario L6G 1C7
Office D2-260, Phone (905)413-3240 ; FAX  (905)413-4839


                                                                           
             "Michael Kay"                                                 
             <mhk@mhk.me.uk>                                               
                                                                        To 
             06/08/2004 07:57         Igor Hersht/Toronto/IBM@IBMCA        
             PM                                                         cc 
                                      <public-qt-comments@w3.org>,         
                                      <ashokmalhotra@alum.mit.edu>,        
                                      <Stephen.Buxton@oracle.com>          
                                                                   Subject 
                                      RE: PLease define 'collation'        
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




>
> >A collation is a mapping from strings to sequences of integers,
> >referred to as collation units. This mapping can be described as a
>  >function C(xs:string)->xs:integer*.
>
> >Two strings are considered equal if they map to the same
> >sequence of collation units.
>
> Problems.
>
> The definition does not correspond the Unicode Collation
> Algorithm (see
> http://www.unicode.org/unicode/reports/tr10/#Main_Algorithm).
> The algorithm consist of 3 steps
> Step 1. Produce a normalized form of each input string.
> Step 2. The collation element array is built
> Step 3. The sort key is formed.
> String comprising and matching based on the calculated values of
> the sort key.

Our collations aren't restricted to conform to the Unicode Collation
Algorithm. This is a much more abstract definition of what a collation is.
>
> As far as I understand  the term ?collation unit? should corresponds
> the Unicode term ?collation element?.  The step 3 is missing in the
> collation
> definition . Doing comprising based only on the collation
> element is incorrect. The main problem is that collation elements
> represent just part of the string. The sort key represents
> whole string.

I think this is partly because UCA is a parameterisable algorithm, whereas
our concept of a collation has all parameters already bound. I don't think
there's any question that any conceivable ordering of strings can be
achieved by mapping the strings to sequences of integers (indeed, single
integers would do perfectly well) and then sorting the numeric
representations. The question that arises is whether this also supports
contains. I would argue that it does, by construction.
>
> Some of the problems (e.g. Contextual Sensitivity see 1.3 Contextual
> Sensitivity)
> would require very serious rework of the existing collation
> implementations
> (the rework could be more time consuming than the rest of
> XSLT2.0 and XPath
> functionality). Just an example from this chapter (which
> works fine with
> the
> ICU Unicode Collation Algorithm based implementation).
> ?In French and a few other languages, however, it is the last accent
> difference
> that determines the order?
> Normal Accent Ordering  {'c','o','t', 0x00EA}  < {'c',0x00F4,'t', 'e'}
> French Accent Ordering  {'c','o','t', 0x00EA}  > {'c',0x00F4,'t', 'e'}

I think you are looking at this too much from an implementation
perspective.
I was trying to find an abstract definition of what a collation is. As I
said, for equality and ordering it would be quite enough to say that a
collation is a mapping from strings to integers (the position in the final
sequence). I was hoping that by saying it is a mapping to a sequence of
integers then we can also imply some properties of functions like
contains(), for example that contains(A,B) is true if A=B is true, and that
startswith(A, B) implies A <= B.

The question emerged recently of whether a collation is guaranteed to
implement equality in a way that is transitive. Under my definition, it is.
We haven't previously had a definition that imposed this constraint.
>
> There is another simple solution which is to create  sort key from the
> collation
> elements and map the key to an integer sequence. I don?t
> think that the
> collation
> definition make sense here.

>
>
> Solution
>
> 2. Common collation definitions
> A collation is a parameter for 2 functions fn:compare and fn:match.

I'm not following you. These are presumably not the existing functions
called fn:compare and fn:match? Are these standard functions that we
define,
or are they functions peculiar to each collation?

I can't infer from your definition any constraints on what a collation
produces, e.g. any relationship between equals and contains.

Michael Kay

>
> fn:compare( $arg1  as xs:string?, $arg2  as xs:string?,
> $collation  as xs:string) as xs:integer
> Two strings (arg1 and arg2) are considered equal if
> fn:compare returns 0.
> A string arg1 is considered greater than a string arg2 if fn:compare()
> return is more than 0.
>
> fn:match( $arg1  as xs:string?, $arg2  as xs:string?, $collation  as
> xs:string) as xs:integer*
> fn:match returns 2 integer sequence  (S(start,end)) or an
> empty ESsequence.
>
> Definitions of fn:contains, fn:starts-with, fn:ends-with,
> fn:substring-before and fn:substring-after are quiet obvious in terms
> of fn:match.
>
> fn:contains return true if fn:match returns not empty sequence.
> fn:starts-with return true if fn:match returns not empty sequence
> and start = 0.
> fn:ends-with return true if fn:match returns not empty sequence
> and end = arg1.length() -1.
> fn:substring-before an empty string if fn:match returns an
> empty sequence.
> Otherwise fn:substring-before returns arg1.substring(0,start).
> fn:substring-afret an empty string if fn:match returns an
> empty sequence.
> Otherwise fn:substring-before returns
> arg1.substring(end, arg1.length()).
>
> 2.1 Specific collations
> Specific collations should be defined as definitions for
> fn:compare and fn:match functions.
>
>
> 2.1.1 Unicode collation
> fn:compare and fn:match for the Unicode collation.
> defined in http://www.unicode.org/unicode/reports/tr10.
> The Unicode collation define different kinds of matching.
> I think we can use an ambiguous -minimal one the(see 8 Searching and
> Matching).
> It could be good idea to define the maximal one (seems to
> corresponds a
> common
> sense). The match is maximal if for all positive i and j,
> there is no match
> at Q[s-i,e+j].
>
> 2.1.2 The Unicode codepoint collation
> fn:compare and fn:match is obvious here and corresponds and
> corresponds to
> the
> Michael Kay's one.
>
> 2.1.3  Other collations.
> Other collations are implementation defined.
>
>
>
> Igor Hersht
> XSLT Development
> IBM Canada Ltd., 8200 Warden Avenue, Markham, Ontario L6G 1C7
> Office D2-260, Phone (905)413-3240 ; FAX  (905)413-4839
>
Received on Wednesday, 9 June 2004 16:33:43 UTC