- From: Philip Guenther <guenther+collation@sendmail.com>
- Date: Mon, 17 Oct 2005 02:30:48 -0700
- To: Arnt Gulbrandsen <arnt@gulbrandsen.priv.no>
- Cc: Martin Duerst <duerst@it.aoyama.ac.jp>, public-ietf-collation@w3.org
Arnt Gulbrandsen <arnt@gulbrandsen.priv.no> writes: >Martin Duerst writes: >> At 17:24 05/09/22, Arnt Gulbrandsen wrote: >> >1. Collators should get octet strings from the protocol. ... >> Up to now, my assumption was that all collations operate on character >> strings, > >(mine too) > >> and that the 'octet' collation was either a bad name or the exception >> that proved the rule (until recently, I didn't get much of an answer >> on that from Chris). > >i-octet, ascii-numeric and Cyrus' date collator (for sieve) persuade me >that this isn't so. The very raison d'etre for a collator is that it is >NOT strcmp(). The collator draft/RFC defines a small API, and a >collator is something that implements that API on a given data type. >That data type may be "Turkish unicode text" or it may be "email >addresses" or it may be "numbers" or it may be "arbitrary octet >strings" or it may be "US street addresses" or it may be "Swiss >telephone directory entries". Yes and no. With the possible exception of i;octet, all the collations mentioned or documented operate on character strings, in the sense that the charset of the input data must be taken into account. For example, the i;ascii-numeric comparator should treat the three character string "127" as equal to the number 127 whether or not the string was originally encoded as 3 octets in US-ASCII, 6 octets in UTF-16BE, or 16 bytes in UTF-32 (w/BOM). In addition, IMHO, that comparator should not fail on any input that was a valid character string with at least one leading digit, even if the succeeding text cannot all be mapped into US-ASCII. I can't tell whether that's true with the current text of section 9.1.1. (Note that I _don't_ think that i;ascii-numeric should map the various digit forms in Unicode down to the US-ASCII digits before performing its comparison, mainly because such a definition would conflict with all current practice for the comparator, but also because a comparator that operated correctly on all the digits in Unicode (is bidi processing needed too?) should have a name like "i;basic-numeric".) So, for most comparators, there's an obvious and critical charset handling step that must take place on the protocol side of the picture. It takes place over there because it's protocol-specific how the charset is specified for any given datum and the protocol needs to specify the handling for data where the decoding/charset-mapping step fails. For example, the I-D of the Sieve revision (draft-ietf-sieve-3028bis-04) says: Comparisons are performed in Unicode. Implementations convert text from header fields in all charsets [MIME3] to Unicode as input to the comparator (see 2.7.3). Implementations MUST be capable of converting US-ASCII, ISO-8859-1, the US-ASCII subset of ISO-8859-* character sets, and UTF-8. Text that the implementation cannot convert to Unicode for any reason MAY be treated as plain US-ASCII (including any [MIME3] syntax) or processed according to local conventions. An encoded NUL octet (character zero) SHOULD NOT cause early termination of the header content being compared against. While the first sentence of that may be contentious (it used to say they're performed in UTF-8), I think it's clear that the description of the decoding, the suggested handling of conversion failure, and the list of mandatory-to-implement charsets all belong here, in the "protocol" spec, and not in the generic comparator doc and certainly not in the individual comparator descriptions. With the exception of i;octet, the comparator definitions can all be written to be encoding-agnostic, operating on Unicode codepoints. Indeed, the basic and nameprep collations already are specified that way. The trick with i;octet is that we want to be able to use it on non-textual data, where there's no charset by which the octets can be interpreted. The protocol may still need to specify that the octet string being matched against is the result of decoding an encoding (say, the Content-Transfer-Encoding of a MIME part), but it's definitely providing an octet stream sans charset. To summarize, I think the following are clear requirements: - for protcols that support matching against non-textual data (e.g., Sieve 'body' extension, IMAP w/COMPARATOR): - the i;octet comparator MUST perform the obvious direct operation - the protocol profile MUST specify what decodings are performed to get the octets to match against (e.g., C-T-E decoding) - for all protocols, when matching against textual data with a comparator _other_ than i;octet - charsets must be converted to Unicode - the protocol profile must specify what charsets are mandatory to implement I'm 99% sure that non-textual data should be skipped/never-matched when using a comparator other than i;octet. Performing such a comparison would implicitly require interpreting the non-textual data as being text in some charset, which just doesn't have any real utility, IMHO. What isn't obvious to me is how i;octet should behave with textual data. I can see two choices: 1) convert other charsets to Unicode, then match against the UTF-8 encoding of that result 2) do no charset conversion; match against the raw (decoded) octets I believe the existing practice in Sieve and ACAP is choice (1). IMHO, choice (2) is more consistent by virtue of clearly setting i;octet at a level below charsets, but doing so leaves no way to perform a comparison where charsets are converted but no canonicalization of the characters is done. >> No, my understanding would be that they have to parse the *character* >> string and work on the resulting value. (note that any serious >> collator has to in one way or another parse the string, e.g. to >> separate base letter, diacritics, and case, or whatever). > >If the collator gets a character string from the protocol, then a) the >protocol either cannot work on octets, or b) it has to weed out octet >strings that don't correspond to character strings before using the >collator. > >B is a design violation in my view. It implies that something outside >the collator performs a duty specified by the collator. I see three levels of consideration. From lowest to highest: - transfer decoding. This is generates a string of octets and is completely protocol-specific, including the specification of how invalid encodings (e.g., a '$' in a base64 encoded part) should be handled. This step is trivial for ACAP, as it's binary-clean, but IMAP and Sieve have to consider Content-Transfer-Encoding and RFC 2047 tokens here. - charset (encoding) conversion. Again, this is protocol-specific. For example, _everything_ in ACAP is in UTF-8, while IMAP and Sieve have to pull the charset from the Content-Type MIME field or from RFC 2047 tokens. The protocol should specify what to do if the octet string isn't actually valid in the charset encoding it claims to be in (e.g., octet 0x80 in UTF-8). - parsing of strings. This is specific to the comparator. i;ascii-numeric looks at numeric prefixes, while a date comparator would perform some sort of parse (ISO 8601? RFC 2822? RFC 3339?) to determine what date is represented by the string. Philip Guenther
Received on Monday, 17 October 2005 09:31:27 UTC