W3C home > Mailing lists > Public > public-qt-comments@w3.org > June 2004

RE: PLease define 'collation'

From: Michael Kay <mhk@mhk.me.uk>
Date: Wed, 9 Jun 2004 00:57:23 +0100
To: "'Igor Hersht'" <igorh@ca.ibm.com>
Cc: <public-qt-comments@w3.org>, <ashokmalhotra@alum.mit.edu>, <Stephen.Buxton@oracle.com>
Message-Id: <20040608235800.D8F2BA23FE@frink.w3.org>

> 
> >A collation is a mapping from strings to sequences of integers,
> >referred to as collation units. This mapping can be described as a
>  >function C(xs:string)->xs:integer*.
> 
> >Two strings are considered equal if they map to the same
> >sequence of collation units.
> 
> Problems.
> 
> The definition does not correspond the Unicode Collation
> Algorithm (see
> http://www.unicode.org/unicode/reports/tr10/#Main_Algorithm).
> The algorithm consist of 3 steps
> Step 1. Produce a normalized form of each input string.
> Step 2. The collation element array is built
> Step 3. The sort key is formed.
> String comprising and matching based on the calculated values of
> the sort key.

Our collations aren't restricted to conform to the Unicode Collation
Algorithm. This is a much more abstract definition of what a collation is.
> 
> As far as I understand  the term ?collation unit? should corresponds
> the Unicode term ?collation element?.  The step 3 is missing in the
> collation
> definition . Doing comprising based only on the collation
> element is incorrect. The main problem is that collation elements
> represent just part of the string. The sort key represents 
> whole string.

I think this is partly because UCA is a parameterisable algorithm, whereas
our concept of a collation has all parameters already bound. I don't think
there's any question that any conceivable ordering of strings can be
achieved by mapping the strings to sequences of integers (indeed, single
integers would do perfectly well) and then sorting the numeric
representations. The question that arises is whether this also supports
contains. I would argue that it does, by construction.
> 
> Some of the problems (e.g. Contextual Sensitivity see 1.3 Contextual
> Sensitivity)
> would require very serious rework of the existing collation 
> implementations
> (the rework could be more time consuming than the rest of 
> XSLT2.0 and XPath
> functionality). Just an example from this chapter (which 
> works fine with
> the
> ICU Unicode Collation Algorithm based implementation).
> ?In French and a few other languages, however, it is the last accent
> difference
> that determines the order?
> Normal Accent Ordering  {'c','o','t', 0x00EA}  < {'c',0x00F4,'t', 'e'}
> French Accent Ordering  {'c','o','t', 0x00EA}  > {'c',0x00F4,'t', 'e'}

I think you are looking at this too much from an implementation perspective.
I was trying to find an abstract definition of what a collation is. As I
said, for equality and ordering it would be quite enough to say that a
collation is a mapping from strings to integers (the position in the final
sequence). I was hoping that by saying it is a mapping to a sequence of
integers then we can also imply some properties of functions like
contains(), for example that contains(A,B) is true if A=B is true, and that
startswith(A, B) implies A <= B.

The question emerged recently of whether a collation is guaranteed to
implement equality in a way that is transitive. Under my definition, it is.
We haven't previously had a definition that imposed this constraint.
> 
> There is another simple solution which is to create  sort key from the
> collation
> elements and map the key to an integer sequence. I don?t 
> think that the
> collation
> definition make sense here.

> 
> 
> Solution
> 
> 2. Common collation definitions
> A collation is a parameter for 2 functions fn:compare and fn:match.

I'm not following you. These are presumably not the existing functions
called fn:compare and fn:match? Are these standard functions that we define,
or are they functions peculiar to each collation?

I can't infer from your definition any constraints on what a collation
produces, e.g. any relationship between equals and contains.

Michael Kay

> 
> fn:compare( $arg1  as xs:string?, $arg2  as xs:string?,
> $collation  as xs:string) as xs:integer
> Two strings (arg1 and arg2) are considered equal if 
> fn:compare returns 0.
> A string arg1 is considered greater than a string arg2 if fn:compare()
> return is more than 0.
> 
> fn:match( $arg1  as xs:string?, $arg2  as xs:string?, $collation  as
> xs:string) as xs:integer*
> fn:match returns 2 integer sequence  (S(start,end)) or an 
> empty ESsequence.
> 
> Definitions of fn:contains, fn:starts-with, fn:ends-with,
> fn:substring-before and fn:substring-after are quiet obvious in terms
> of fn:match.
> 
> fn:contains return true if fn:match returns not empty sequence.
> fn:starts-with return true if fn:match returns not empty sequence
> and start = 0.
> fn:ends-with return true if fn:match returns not empty sequence
> and end = arg1.length() -1.
> fn:substring-before an empty string if fn:match returns an 
> empty sequence.
> Otherwise fn:substring-before returns arg1.substring(0,start).
> fn:substring-afret an empty string if fn:match returns an 
> empty sequence.
> Otherwise fn:substring-before returns
> arg1.substring(end, arg1.length()).
> 
> 2.1 Specific collations
> Specific collations should be defined as definitions for
> fn:compare and fn:match functions.
> 
> 
> 2.1.1 Unicode collation
> fn:compare and fn:match for the Unicode collation.
> defined in http://www.unicode.org/unicode/reports/tr10.
> The Unicode collation define different kinds of matching.
> I think we can use an ambiguous -minimal one the(see 8 Searching and
> Matching).
> It could be good idea to define the maximal one (seems to 
> corresponds a
> common
> sense). The match is maximal if for all positive i and j, 
> there is no match
> at Q[s-i,e+j].
> 
> 2.1.2 The Unicode codepoint collation
> fn:compare and fn:match is obvious here and corresponds and 
> corresponds to
> the
> Michael Kay's one.
> 
> 2.1.3  Other collations.
> Other collations are implementation defined.
> 
> 
> 
> Igor Hersht
> XSLT Development
> IBM Canada Ltd., 8200 Warden Avenue, Markham, Ontario L6G 1C7
> Office D2-260, Phone (905)413-3240 ; FAX  (905)413-4839
> 
Received on Tuesday, 8 June 2004 19:58:01 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:45:19 UTC