- From: Philip Guenther <guenther+collation@sendmail.com>
- Date: Wed, 21 Sep 2005 22:15:20 -0700
- To: public-ietf-collation@w3.org
[The following is rather rambling and mostly centers around the problems created by the mixing of 'characters' in the abstract with the "i;octet" collation. This all makes my head hurt, so please forgive its fragmented nature. Those who saw an earlier version of this message will find the first paragraph much clearer now.] If I'm following the history of this document correctly, it appears that the character strings referred to throughout the document are all assumed to be in Unicode. Previous to revision -02, everything was in UTF-8, so that was clearly the intent then. At the least, it isn't actually specified what input range a collation is required to define its behavior on or what the default handling is for characters which are not covered by the collations definition. This problem can be seen in the definition of "en;ascii-casemap". Is it an error to supply characters outside the US-ASCII charset to that collation? It seems to be defined as operating on octets instead of characters, given how it's defined in terms of the "i;octet" collation. Similarly, the fallback to "i;octet" when sorting invalid strings (section 4.3, paragraph 2) doesn't really make sense when sorting is operating on characters instead of octets. This is especially confusing when section 5.7 says that it's protocol dependent whether "i;octet" is supported. Speaking of 5.7, it needs to say that _if_ a protocol permits the use of the "i;octet" collation, then the protocol must specify how the octets to be compared are produced. While protocols that only operate on textual data can simply say "encode characters in UTF-8", protocols that can match against non-textual data may want to match against the 'raw octets' of the data, as seen in the IMAP example in section 4.3 of draft-ietf-imapext-i18n-05. Or is the model that the octets of the GIF are converted (interpreted?) as characters so that they can be fed to the collation...which then converts (reinterprets?) them back as octets? I hit this trying to figure out what the sieve 'body' extension needs to say about matching against non-textual parts. Both there and in IMAP SEARCH, it can be argued that searches against non-textual parts should only match when the "i;octet" collation is being used. Other collations require characters, and the octets of, for example, an image/jpeg part, have no encoding to map them to characters. Actually, one could argue that "i;octet" should _only_ be used on non-textual data and the trivial comparison of characters should be called something like "i;trivial", because it isn't comparing octets, but that's a significant break with the past. Section 3.1 should specify that collation names are case-sensitive, or at least I've been _assuming_ they are. Philip Guenther
Received on Thursday, 22 September 2005 05:15:50 UTC