RE: PLease define 'collation' from Michael Kay on 2004-06-03 (public-qt-comments@w3.org from June 2004)

From: Michael Kay <mhk@mhk.me.uk>
Date: Thu, 3 Jun 2004 09:57:29 +0100
To: "'Ashok Malhotra'" <ashokmalhotra@alum.mit.edu>, "'Stephen Buxton'" <Stephen.Buxton@oracle.com>
Cc: <public-qt-comments@w3.org>
Message-Id: <20040603085807.3411AA0A6F@frink.w3.org>

> 
> A collation is an artefact that assigns a sort order for 
> every codepoint or sequence of codepoints.  The sort order is 
> implementation dependent and can vary widely.  The sort order 
> may or may not distinguish between lower and 
> upper-case variants of codepoints or accented and unaccented 
> versions of codepoints.  It may also choose to ignore 
> punctuation between certain codepoints. 
> 

As David points out, I think the definitions we have are already
significantly better than this. We could try something more formal, as
follows:

A collation is a mapping from strings to sequences of integers, referred to
as collation units. This mapping can be described as a function
C(xs:string)->xs:integer*. 

Two strings are considered equal if they map to the same sequence of
collation units. Thus P eq Q under collation C is true if and only if
deep-equals(C(P), C(Q)). 

A string P is considered greater than a string Q if the sequence of
collation units corresponding to P is lexicographically greater than the
sequence of collation units corresponding to Q. More specifically, P gt Q
under collation C is true if and only if <some $n in 1 to count(C(P))
satisfies (C(P)[$n] gt C(Q)[$n] and every $m in 1 to ($n - 1) satisfies
C(P)[$m] eq C(Q)[$m])>. 

A string P contains string Q at position N under collation C if and only if
deep-equals(subsequence(C(P), N, count(C(Q))), C(Q)).

contains(A, B, C) is true if there exists a position N such that A contains
B at position N under collation C.

starts-with(A, B, C) is true if A contains B at position 1 under collation
C.

ends-with(A, B, C) is true if A contains B at position (count(C(A)) -
count(C(B)) + 1) under collation C.

substring-before(A, B, C) returns "" if contains(A, B, C) is false;
otherwise it determines the smallest value N such that A contains B at
position N under collation C, and then returns the longest string R such
that deep-equals(C(R), subsequence(C(A), 1, N -1).

substring-after(A, B, C) returns "" if contains(A, B, C) is false; otherwise
it determines the smallest value N such that A contains B at position N
under collation C, and then returns the longest string R such that
deep-equals(C(R), subsequence(C(A), N + count(C(B))).

The Unicode codepoint collation is the collation whose mapping function is
fn:string-to-codepoints().

["the longest string R" catches the ambiguity noted by Henry Z, in the case
where some characters map to a zero-length sequence of collation units. The
ambiguity could equally be resolved by writing "the shortest string R"]

Note that this definition gets rid of the idea that some collations don't
support contains(). A valid collation is the mapping "January"->1,
"February"->2, etc, and with such a collation each string will contain
itself and will not contain any other string.

Michael Kay

N.B.: the above definitions need to be carefully checked.

Received on Thursday, 3 June 2004 04:58:07 UTC