- From: Michael Kay <mhk@mhk.me.uk>
- Date: Thu, 3 Jun 2004 09:57:29 +0100
- To: "'Ashok Malhotra'" <ashokmalhotra@alum.mit.edu>, "'Stephen Buxton'" <Stephen.Buxton@oracle.com>
- Cc: <public-qt-comments@w3.org>

> > A collation is an artefact that assigns a sort order for > every codepoint or sequence of codepoints. The sort order is > implementation dependent and can vary widely. The sort order > may or may not distinguish between lower and > upper-case variants of codepoints or accented and unaccented > versions of codepoints. It may also choose to ignore > punctuation between certain codepoints. > As David points out, I think the definitions we have are already significantly better than this. We could try something more formal, as follows: A collation is a mapping from strings to sequences of integers, referred to as collation units. This mapping can be described as a function C(xs:string)->xs:integer*. Two strings are considered equal if they map to the same sequence of collation units. Thus P eq Q under collation C is true if and only if deep-equals(C(P), C(Q)). A string P is considered greater than a string Q if the sequence of collation units corresponding to P is lexicographically greater than the sequence of collation units corresponding to Q. More specifically, P gt Q under collation C is true if and only if <some $n in 1 to count(C(P)) satisfies (C(P)[$n] gt C(Q)[$n] and every $m in 1 to ($n - 1) satisfies C(P)[$m] eq C(Q)[$m])>. A string P contains string Q at position N under collation C if and only if deep-equals(subsequence(C(P), N, count(C(Q))), C(Q)). contains(A, B, C) is true if there exists a position N such that A contains B at position N under collation C. starts-with(A, B, C) is true if A contains B at position 1 under collation C. ends-with(A, B, C) is true if A contains B at position (count(C(A)) - count(C(B)) + 1) under collation C. substring-before(A, B, C) returns "" if contains(A, B, C) is false; otherwise it determines the smallest value N such that A contains B at position N under collation C, and then returns the longest string R such that deep-equals(C(R), subsequence(C(A), 1, N -1). substring-after(A, B, C) returns "" if contains(A, B, C) is false; otherwise it determines the smallest value N such that A contains B at position N under collation C, and then returns the longest string R such that deep-equals(C(R), subsequence(C(A), N + count(C(B))). The Unicode codepoint collation is the collation whose mapping function is fn:string-to-codepoints(). ["the longest string R" catches the ambiguity noted by Henry Z, in the case where some characters map to a zero-length sequence of collation units. The ambiguity could equally be resolved by writing "the shortest string R"] Note that this definition gets rid of the idea that some collations don't support contains(). A valid collation is the mapping "January"->1, "February"->2, etc, and with such a collation each string will contain itself and will not contain any other string. Michael Kay N.B.: the above definitions need to be carefully checked.

Received on Thursday, 3 June 2004 04:58:07 UTC