From: Michael Kay <mhk@mhk.me.uk>
Date: Thu, 3 Jun 2004 09:57:29 +0100
To: "'Ashok Malhotra'" <ashokmalhotra@alum.mit.edu>, "'Stephen Buxton'" <Stephen.Buxton@oracle.com>

Message-Id: <20040603085807.3411AA0A6F@frink.w3.org>
```
>
> A collation is an artefact that assigns a sort order for
> every codepoint or sequence of codepoints.  The sort order is
> implementation dependent and can vary widely.  The sort order
> may or may not distinguish between lower and
> upper-case variants of codepoints or accented and unaccented
> versions of codepoints.  It may also choose to ignore
> punctuation between certain codepoints.
>

As David points out, I think the definitions we have are already
significantly better than this. We could try something more formal, as
follows:

A collation is a mapping from strings to sequences of integers, referred to
as collation units. This mapping can be described as a function
C(xs:string)->xs:integer*.

Two strings are considered equal if they map to the same sequence of
collation units. Thus P eq Q under collation C is true if and only if
deep-equals(C(P), C(Q)).

A string P is considered greater than a string Q if the sequence of
collation units corresponding to P is lexicographically greater than the
sequence of collation units corresponding to Q. More specifically, P gt Q
under collation C is true if and only if <some \$n in 1 to count(C(P))
satisfies (C(P)[\$n] gt C(Q)[\$n] and every \$m in 1 to (\$n - 1) satisfies
C(P)[\$m] eq C(Q)[\$m])>.

A string P contains string Q at position N under collation C if and only if
deep-equals(subsequence(C(P), N, count(C(Q))), C(Q)).

contains(A, B, C) is true if there exists a position N such that A contains
B at position N under collation C.

starts-with(A, B, C) is true if A contains B at position 1 under collation
C.

ends-with(A, B, C) is true if A contains B at position (count(C(A)) -
count(C(B)) + 1) under collation C.

substring-before(A, B, C) returns "" if contains(A, B, C) is false;
otherwise it determines the smallest value N such that A contains B at
position N under collation C, and then returns the longest string R such
that deep-equals(C(R), subsequence(C(A), 1, N -1).

substring-after(A, B, C) returns "" if contains(A, B, C) is false; otherwise
it determines the smallest value N such that A contains B at position N
under collation C, and then returns the longest string R such that
deep-equals(C(R), subsequence(C(A), N + count(C(B))).

The Unicode codepoint collation is the collation whose mapping function is
fn:string-to-codepoints().

["the longest string R" catches the ambiguity noted by Henry Z, in the case
where some characters map to a zero-length sequence of collation units. The
ambiguity could equally be resolved by writing "the shortest string R"]

Note that this definition gets rid of the idea that some collations don't
support contains(). A valid collation is the mapping "January"->1,
"February"->2, etc, and with such a collation each string will contain
itself and will not contain any other string.

Michael Kay

N.B.: the above definitions need to be carefully checked.
```
Received on Thursday, 3 June 2004 04:58:07 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 March 2012 18:14:34 GMT