- From: Mark Davis <mark.davis@jtcsv.com>
- Date: Tue, 19 Oct 2004 10:19:48 -0700
- To: <public-ietf-collation@w3.org>, "Martin Duerst" <duerst@w3.org>
- Cc: "Chris Newman" <Chris.Newman@Sun.COM>, <cldr@unicode.org>
I made a quick pass through the draft at http://www.w3.org/2004/08/ietf-collation. Here are some high-level comments. 1. I agree with others that the statement of the algorithm needs to be in terms of collating Unicode code points. It could be applied to a bag o' bytes tagged with an IANA charset tag, but the interpretation should always be in terms of how those bytes map to Unicode. 2. The scope needs to be clarified. I break it down as: comparison - given two strings, determine whether the ordering is greater, less, or equal. (The current draft splits out equality; not sure why that should be distinguished, but it would be ok to do that too) sortkey generation - given a string, generate a sequence of bytes called a sort key. Two sort keys (for the same collation) will binary compare the same as the comparison above. substring matching - given a key, a target, and an offset within that target, provide two testing operations: the key matches a substring of the target starting at the offset (and return the end-offset in the target) the key matches a substring of the target ending at the offset (and return the start-offset in the target) The matching must be coordinated with the comparison operation. [See also http://www.unicode.org/reports/tr10/#Searching] substring searching - given a key, a target and a range (pair of offsets) in the target, provide two operations: the first offset pair in the range where the key matches the last offset pair in the range where the key matches. The matching must be coordinated with the comparison operation. [See also http://www.unicode.org/reports/tr10/#Searching] 3. There are a few more conditions on comparison that must be added. In particular, it must be transitive. However, I think the conditions on NULL are too much; it should be perfectly legal for an implementation to throw an error for NULL (as opposed to empty). 4. The logical mechanism in i;ascii-numeric does not work for sort keys, since you don't have the length of the other string available. So the logical specification needs to be different. 5. Basic collation: > For the normalization step, <http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt> is used. This is bad; that is a quite old version of Unicode by this time. For Nameprep collation it is fine to tie the version back, but not for basic collation. 5b. The options in http://www.unicode.org/reports/tr35/#<collations>, "Collation Settings" need to be added, and should have the corresponding syntax defined. An implicit feature of UTS #35 is the locale id, so that definitely needs to be there. 5c. Using only level 1 for equality checking will surprise a great many people. That would make "Durst" = "DÜRST". This is just a quick first pass; I don't want to get down into the details yet. I am cc'ing the Unicode CLDR technical committee also. Mark
Received on Tuesday, 19 October 2004 17:19:54 UTC