draft-newman-i18n-comparator from Mark Davis on 2004-10-19 (public-ietf-collation@w3.org from October 2004)

From: Mark Davis <mark.davis@jtcsv.com>
Date: Tue, 19 Oct 2004 10:19:48 -0700
To: <public-ietf-collation@w3.org>, "Martin Duerst" <duerst@w3.org>
Cc: "Chris Newman" <Chris.Newman@Sun.COM>, <cldr@unicode.org>
Message-ID: <00da01c4b5ff$d6602690$336d3009@sanjose.ibm.com>

I made a quick pass through the draft at
http://www.w3.org/2004/08/ietf-collation. Here are some high-level comments.

1. I agree with others that the statement of the algorithm needs to be in
terms of collating Unicode code points. It could be applied to a bag o'
bytes tagged with an IANA charset tag, but the interpretation should always
be in terms of how those bytes map to Unicode.

2. The scope needs to be clarified. I break it down as:
comparison - given two strings, determine whether the ordering is greater,
less, or equal.
(The current draft splits out equality; not sure why that should be
distinguished, but it would be ok to do that too)

sortkey generation - given a string, generate a sequence of bytes called a
sort key. Two sort keys (for the same collation) will binary compare the
same as the comparison above.

substring matching - given a key, a target, and an offset within that
target, provide two testing operations:
  the key matches a substring of the target starting at the offset (and
return the end-offset in the target)
  the key matches a substring of the target ending at the offset (and return
the start-offset in the target)
The matching must be coordinated with the comparison operation. [See also
http://www.unicode.org/reports/tr10/#Searching]

substring searching - given a key, a target and a range (pair of offsets) in
the target, provide two operations:
  the first offset pair in the range where the key matches
  the last offset pair in the range where the key matches.
The matching must be coordinated with the comparison operation. [See also
http://www.unicode.org/reports/tr10/#Searching]

3. There are a few more conditions on comparison that must be added. In
particular, it must be transitive. However, I think the conditions on NULL
are too much; it should be perfectly legal for an implementation to throw an
error for NULL (as opposed to empty).

4. The logical mechanism in i;ascii-numeric does not work for sort keys,
since you don't have the length of the other string available. So the
logical specification needs to be different.

5. Basic collation:
> For the normalization step,
<http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt> is used.
This is bad; that is a quite old version of Unicode by this time. For
Nameprep collation it is fine to tie the version back, but not for basic
collation.

5b. The options in http://www.unicode.org/reports/tr35/#<collations>,
"Collation Settings" need to be added, and should have the corresponding
syntax defined. An implicit feature of UTS #35 is the locale id, so that
definitely needs to be there.

5c. Using only level 1 for equality checking will surprise a great many
people. That would make "Durst" = "DÜRST".

This is just a quick first pass; I don't want to get down into the details
yet. I am cc'ing the Unicode CLDR technical committee also.

Mark

Received on Tuesday, 19 October 2004 17:19:54 UTC