- From: Arnt Gulbrandsen <arnt@gulbrandsen.priv.no>
- Date: Thu, 22 Sep 2005 14:48:43 +0200
- To: Martin Duerst <duerst@it.aoyama.ac.jp>
- Cc: Philip Guenther <guenther+collation@sendmail.com>, public-ietf-collation@w3.org
(I am amazed. Quick response on a collation question.) Martin Duerst writes: > At 17:24 05/09/22, Arnt Gulbrandsen wrote: > >1. Collators should get octet strings from the protocol. > > Sorry, but this assumption isn't generally true. In XQuery, collations > are always applied to (Unicode) character strings. This is somewhat > due to the fact that XQuery isn't a protocol. But there is no need to > restrict the use of collators to protocols. I agree and disagree. First, I was using the word protocol in the draft's sense, which is so wide that I had protocol for lunch today ;) Second, while others may indeed use collators, what RFCs define is protocols. Anything else that can get a free ride is a (highly desirable) added bonus. We can care about XQuery. > Up to now, my assumption was that all collations operate on character > strings, (mine too) > and that the 'octet' collation was either a bad name or the exception > that proved the rule (until recently, I didn't get much of an answer > on that from Chris). i-octet, ascii-numeric and Cyrus' date collator (for sieve) persuade me that this isn't so. The very raison d'etre for a collator is that it is NOT strcmp(). The collator draft/RFC defines a small API, and a collator is something that implements that API on a given data type. That data type may be "Turkish unicode text" or it may be "email addresses" or it may be "numbers" or it may be "arbitrary octet strings" or it may be "US street addresses" or it may be "Swiss telephone directory entries". > No, my understanding would be that they have to parse the *character* > string and work on the resulting value. (note that any serious > collator has to in one way or another parse the string, e.g. to > separate base letter, diacritics, and case, or whatever). If the collator gets a character string from the protocol, then a) the protocol either cannot work on octets, or b) it has to weed out octet strings that don't correspond to character strings before using the collator. B is a design violation in my view. It implies that something outside the collator performs a duty specified by the collator. A is useless to IMAP. IMAP clients _can_ work on non-text body parts. My conclusion is that it's better to define collation in terms of the octet and encoding. [ about date-time collation ] > Still, the input would be a character string, wouldn't it? It would be an octet string. Whether it also would be a character string is open to discussion. If we say collators get character strings, then the protocol has to know the character encoding used. In the case of unicode/utf-8, the protocol has to parse the octet string, make sure there are no illegal sequences, make sure the octet string does not end in the middle of a utf-8 character, and only then can it give the character string to the collator. For some implementations this is not an added burden, for others it is. If we say octet strings, then "illegal UTF-8 sequence" becomes the same error as "non-digit in number" and "month > 12 in date-time" and so on. I think that's an attractive regularity. In that case, the collator defines what its legal input is, and the collator checks that its input is legal. Of course, implementations are free to optimise by converting/checking input anywhere else. All this affects is the definition held by IANA. > (internal date formats that are not character strings are usually > constructed so that sorting is trivial, i.e. no parsing needed). Cyrus' example wasn't. He specifically mentioned using only the date part of a date-time. String-based equality testing won't do that. > I'm not sure I agree. It looks like an interesting generalization, but > I don't think we need to go that far just to solve the i;octet issue. > Also, it no longer cover the issue of using a numeric collator for > cases such as XQuery, and even simple cases such as a Unix sort > command (immagine that it would come with an option to specify a > collator for a field). I don't understand what you're trying to say here. > If you expand the model, there are a lot of other cases where formats > may not match or there is a domain problem. > > Thinking about it a bit in the last few days, the i;octet collator's > problem isn't the lack of domains, it's that there are two domains > for it. As an example, consider a set of strings encoded in > UTF-16-BE. Should i;octet be applied to the raw binary form, or > should it be applied after converting to UTF-8. The later results in > a simple ordering by Unicode codepoint, the former doesn't. The former, because in the statement "compare these two strings using i;octet" there is no implication that both strings are UTF-16-BE. Using an IMAP example, it's not unreasonable to say "find the messages which have a bodypart whose first four bytes are 0xFF 0xD8 0xFF 0xE0". To do that, we need a collator that does not assume its input to use any particular encoding. i;octet is the natural candidate. > We definitely need a predefined (and hopefully easy to understand) > name for the later. If there is any protocol/format/language that > needs the former, I think they should get it, but they should have to > explicitly mention that, and they should be aware of the fact that > they are committing a layer violation. I honestly don't see a layer violation. > This is not just theory: Many implementations these days read in data > and convert it to (their preferred form of) Unicode before doing > anything else with it, and having to reconstruct the original octet > sequence may be impossible or extremely annoying. I agree that many implementations convert all text to unicode on input; I've written enough myself. (I happen to know that both the MUA and MTA I use do that.) I do not think that all _input_ is converted to unicode. i;octet is there for when we need to sort or compare data without assuming that it's text. For something like XQuery, it is my understanding that all its input may be text. For a "protocol" which never operates on non-text, i;octet (and other non-text collators) are out of scope. I suppose some unicode-hater will define collators on e.g. GB18030. For an implementation that never supports GB18030, such collators are also out of scope. Right now, I believe it's very difficult to escape implementing i;octet. I guess that needs changing. Arnt
Received on Thursday, 22 September 2005 12:53:31 UTC