Re: comments on draft-newman-i18n-comparator-05.txt from Mark Davis on 2005-09-22 (public-ietf-collation@w3.org from September 2005)

From: Mark Davis <mark.davis@icu-project.org>
Date: Thu, 22 Sep 2005 07:49:53 -0700
To: Arnt Gulbrandsen <arnt@gulbrandsen.priv.no>
CC: Martin Duerst <duerst@it.aoyama.ac.jp>, Philip Guenther <guenther+collation@sendmail.com>, public-ietf-collation@w3.org
Message-ID: <4332C491.7060801@icu-project.org>
 > First, I was using the word protocol in the draft's sense, which is so
 > wide that I had protocol for lunch today ;)

If so, then it needs to be clear from the text that 'protocol' is using 
such a broad sense; examples would help.

On the issue of charsets: the biggest problem currently with using 
charsets other than Unicode is that common IANA identifiers for charsets 
are ambiguous (see http://www.w3.org/TR/japanese-xml/#ambiguity_of_yen). 
So two implementations that they are getting the same results from 
comparison may not be because they are using variant mapping tables.

Arnt Gulbrandsen wrote:
> 
> (I am amazed. Quick response on a collation question.)
> 
> Martin Duerst writes:
> 
>> At 17:24 05/09/22, Arnt Gulbrandsen wrote:
>> >1. Collators should get octet strings from the protocol.
>>
>> Sorry, but this assumption isn't generally true. In XQuery, collations 
>> are always applied to (Unicode) character strings. This is somewhat 
>> due to the fact that XQuery isn't a protocol. But there is no need to 
>> restrict the use of collators to protocols.
> 
> 
> I agree and disagree.
> 
> First, I was using the word protocol in the draft's sense, which is so 
> wide that I had protocol for lunch today ;)

If so, then it needs to be clear from the text that 'protocol' is using 
such a broad sense; examples would help.

> 
> Second, while others may indeed use collators, what RFCs define is 
> protocols. Anything else that can get a free ride is a (highly 
> desirable) added bonus. We can care about XQuery.
> 
>> Up to now, my assumption was that all collations operate on character 
>> strings,
> 
> 
> (mine too)
> 
>> and that the 'octet' collation was either a bad name or the exception 
>> that proved the rule (until recently, I didn't get much of an answer 
>> on that from Chris).
> 
> 
> i-octet, ascii-numeric and Cyrus' date collator (for sieve) persuade me 
> that this isn't so. The very raison d'etre for a collator is that it is 
> NOT strcmp(). The collator draft/RFC defines a small API, and a collator 
> is something that implements that API on a given data type. That data 
> type may be "Turkish unicode text" or it may be "email addresses" or it 
> may be "numbers" or it may be "arbitrary octet strings" or it may be "US 
> street addresses" or it may be "Swiss telephone directory entries".
> 
>> No, my understanding would be that they have to parse the *character* 
>> string and work on the resulting value. (note that any serious 
>> collator has to in one way or another parse the string, e.g. to 
>> separate base letter, diacritics, and case, or whatever).
> 
> 
> If the collator gets a character string from the protocol, then a) the 
> protocol either cannot work on octets, or b) it has to weed out octet 
> strings that don't correspond to character strings before using the 
> collator.
> 
> B is a design violation in my view. It implies that something outside 
> the collator performs a duty specified by the collator.
> 
> A is useless to IMAP. IMAP clients _can_ work on non-text body parts.
> 
> My conclusion is that it's better to define collation in terms of the 
> octet and encoding.
> 
> [ about date-time collation ]
> 
>> Still, the input would be a character string, wouldn't it?
> 
> 
> It would be an octet string. Whether it also would be a character string 
> is open to discussion.
> 
> If we say collators get character strings, then the protocol has to know 
> the character encoding used. In the case of unicode/utf-8, the protocol 
> has to parse the octet string, make sure there are no illegal sequences, 
> make sure the octet string does not end in the middle of a utf-8 
> character, and only then can it give the character string to the 
> collator. For some implementations this is not an added burden, for 
> others it is.
> 
> If we say octet strings, then "illegal UTF-8 sequence" becomes the same 
> error as "non-digit in number" and "month > 12 in date-time" and so on. 
> I think that's an attractive regularity. In that case, the collator 
> defines what its legal input is, and the collator checks that its input 
> is legal.
> 
> Of course, implementations are free to optimise by converting/checking 
> input anywhere else. All this affects is the definition held by IANA.
> 
>> (internal date formats that are not character strings are usually 
>> constructed so that sorting is trivial, i.e. no parsing needed).
> 
> 
> Cyrus' example wasn't. He specifically mentioned using only the date 
> part of a date-time. String-based equality testing won't do that.
> 
>> I'm not sure I agree. It looks like an interesting generalization, but 
>> I don't think we need to go that far just to solve the i;octet issue. 
>> Also, it no longer cover the issue of using a numeric collator for 
>> cases such as XQuery, and even simple cases such as a Unix sort 
>> command (immagine that it would come with an option to specify a 
>> collator for a field).
> 
> 
> I don't understand what you're trying to say here.
> 
>> If you expand the model, there are a lot of other cases where formats 
>> may not match or there is a domain problem.
>>
>> Thinking about it a bit in the last few days, the i;octet collator's 
>> problem isn't the lack of domains, it's that there are two domains for 
>> it. As an example, consider a set of strings encoded in UTF-16-BE. 
>> Should i;octet be applied to the raw binary form, or should it be 
>> applied after converting to UTF-8. The later results in a simple 
>> ordering by Unicode codepoint, the former doesn't.
> 
> 
> The former, because in the statement "compare these two strings using 
> i;octet" there is no implication that both strings are UTF-16-BE.
> 
> Using an IMAP example, it's not unreasonable to say "find the messages 
> which have a bodypart whose first four bytes are 0xFF 0xD8 0xFF 0xE0". 
> To do that, we need a collator that does not assume its input to use any 
> particular encoding. i;octet is the natural candidate.
> 
>> We definitely need a predefined (and hopefully easy to understand) 
>> name for the later. If there is any protocol/format/language that 
>> needs the former, I think they should get it, but they should have to 
>> explicitly mention that, and they should be aware of the fact that 
>> they are committing a layer violation.
> 
> 
> I honestly don't see a layer violation.
> 
>> This is not just theory: Many implementations these days read in data 
>> and convert it to (their preferred form of) Unicode before doing 
>> anything else with it, and having to reconstruct the original octet 
>> sequence may be impossible or extremely annoying.
> 
> 
> I agree that many implementations convert all text to unicode on input; 
> I've written enough myself. (I happen to know that both the MUA and MTA 
> I use do that.) I do not think that all _input_ is converted to unicode. 
> i;octet is there for when we need to sort or compare data without 
> assuming that it's text.
> 
> For something like XQuery, it is my understanding that all its input may 
> be text. For a "protocol" which never operates on non-text, i;octet (and 
> other non-text collators) are out of scope. I suppose some unicode-hater 
> will define collators on e.g. GB18030. For an implementation that never 
> supports GB18030, such collators are also out of scope.
> 
> Right now, I believe it's very difficult to escape implementing i;octet. 
> I guess that needs changing.
> 
> Arnt
> 
> 
> 
>
Received on Thursday, 22 September 2005 14:50:34 UTC