- From: Mark Davis <mark.davis@icu-project.org>
- Date: Thu, 22 Sep 2005 07:49:53 -0700
- To: Arnt Gulbrandsen <arnt@gulbrandsen.priv.no>
- CC: Martin Duerst <duerst@it.aoyama.ac.jp>, Philip Guenther <guenther+collation@sendmail.com>, public-ietf-collation@w3.org
> First, I was using the word protocol in the draft's sense, which is so > wide that I had protocol for lunch today ;) If so, then it needs to be clear from the text that 'protocol' is using such a broad sense; examples would help. On the issue of charsets: the biggest problem currently with using charsets other than Unicode is that common IANA identifiers for charsets are ambiguous (see http://www.w3.org/TR/japanese-xml/#ambiguity_of_yen). So two implementations that they are getting the same results from comparison may not be because they are using variant mapping tables. Arnt Gulbrandsen wrote: > > (I am amazed. Quick response on a collation question.) > > Martin Duerst writes: > >> At 17:24 05/09/22, Arnt Gulbrandsen wrote: >> >1. Collators should get octet strings from the protocol. >> >> Sorry, but this assumption isn't generally true. In XQuery, collations >> are always applied to (Unicode) character strings. This is somewhat >> due to the fact that XQuery isn't a protocol. But there is no need to >> restrict the use of collators to protocols. > > > I agree and disagree. > > First, I was using the word protocol in the draft's sense, which is so > wide that I had protocol for lunch today ;) If so, then it needs to be clear from the text that 'protocol' is using such a broad sense; examples would help. > > Second, while others may indeed use collators, what RFCs define is > protocols. Anything else that can get a free ride is a (highly > desirable) added bonus. We can care about XQuery. > >> Up to now, my assumption was that all collations operate on character >> strings, > > > (mine too) > >> and that the 'octet' collation was either a bad name or the exception >> that proved the rule (until recently, I didn't get much of an answer >> on that from Chris). > > > i-octet, ascii-numeric and Cyrus' date collator (for sieve) persuade me > that this isn't so. The very raison d'etre for a collator is that it is > NOT strcmp(). The collator draft/RFC defines a small API, and a collator > is something that implements that API on a given data type. That data > type may be "Turkish unicode text" or it may be "email addresses" or it > may be "numbers" or it may be "arbitrary octet strings" or it may be "US > street addresses" or it may be "Swiss telephone directory entries". > >> No, my understanding would be that they have to parse the *character* >> string and work on the resulting value. (note that any serious >> collator has to in one way or another parse the string, e.g. to >> separate base letter, diacritics, and case, or whatever). > > > If the collator gets a character string from the protocol, then a) the > protocol either cannot work on octets, or b) it has to weed out octet > strings that don't correspond to character strings before using the > collator. > > B is a design violation in my view. It implies that something outside > the collator performs a duty specified by the collator. > > A is useless to IMAP. IMAP clients _can_ work on non-text body parts. > > My conclusion is that it's better to define collation in terms of the > octet and encoding. > > [ about date-time collation ] > >> Still, the input would be a character string, wouldn't it? > > > It would be an octet string. Whether it also would be a character string > is open to discussion. > > If we say collators get character strings, then the protocol has to know > the character encoding used. In the case of unicode/utf-8, the protocol > has to parse the octet string, make sure there are no illegal sequences, > make sure the octet string does not end in the middle of a utf-8 > character, and only then can it give the character string to the > collator. For some implementations this is not an added burden, for > others it is. > > If we say octet strings, then "illegal UTF-8 sequence" becomes the same > error as "non-digit in number" and "month > 12 in date-time" and so on. > I think that's an attractive regularity. In that case, the collator > defines what its legal input is, and the collator checks that its input > is legal. > > Of course, implementations are free to optimise by converting/checking > input anywhere else. All this affects is the definition held by IANA. > >> (internal date formats that are not character strings are usually >> constructed so that sorting is trivial, i.e. no parsing needed). > > > Cyrus' example wasn't. He specifically mentioned using only the date > part of a date-time. String-based equality testing won't do that. > >> I'm not sure I agree. It looks like an interesting generalization, but >> I don't think we need to go that far just to solve the i;octet issue. >> Also, it no longer cover the issue of using a numeric collator for >> cases such as XQuery, and even simple cases such as a Unix sort >> command (immagine that it would come with an option to specify a >> collator for a field). > > > I don't understand what you're trying to say here. > >> If you expand the model, there are a lot of other cases where formats >> may not match or there is a domain problem. >> >> Thinking about it a bit in the last few days, the i;octet collator's >> problem isn't the lack of domains, it's that there are two domains for >> it. As an example, consider a set of strings encoded in UTF-16-BE. >> Should i;octet be applied to the raw binary form, or should it be >> applied after converting to UTF-8. The later results in a simple >> ordering by Unicode codepoint, the former doesn't. > > > The former, because in the statement "compare these two strings using > i;octet" there is no implication that both strings are UTF-16-BE. > > Using an IMAP example, it's not unreasonable to say "find the messages > which have a bodypart whose first four bytes are 0xFF 0xD8 0xFF 0xE0". > To do that, we need a collator that does not assume its input to use any > particular encoding. i;octet is the natural candidate. > >> We definitely need a predefined (and hopefully easy to understand) >> name for the later. If there is any protocol/format/language that >> needs the former, I think they should get it, but they should have to >> explicitly mention that, and they should be aware of the fact that >> they are committing a layer violation. > > > I honestly don't see a layer violation. > >> This is not just theory: Many implementations these days read in data >> and convert it to (their preferred form of) Unicode before doing >> anything else with it, and having to reconstruct the original octet >> sequence may be impossible or extremely annoying. > > > I agree that many implementations convert all text to unicode on input; > I've written enough myself. (I happen to know that both the MUA and MTA > I use do that.) I do not think that all _input_ is converted to unicode. > i;octet is there for when we need to sort or compare data without > assuming that it's text. > > For something like XQuery, it is my understanding that all its input may > be text. For a "protocol" which never operates on non-text, i;octet (and > other non-text collators) are out of scope. I suppose some unicode-hater > will define collators on e.g. GB18030. For an implementation that never > supports GB18030, such collators are also out of scope. > > Right now, I believe it's very difficult to escape implementing i;octet. > I guess that needs changing. > > Arnt > > > >
Received on Thursday, 22 September 2005 14:50:34 UTC