Re: comments on draft-newman-i18n-comparator-05.txt from Philip Guenther on 2005-10-17 (public-ietf-collation@w3.org from October 2005)

From: Philip Guenther <guenther+collation@sendmail.com>
Date: Mon, 17 Oct 2005 02:30:48 -0700
To: Arnt Gulbrandsen <arnt@gulbrandsen.priv.no>
Cc: Martin Duerst <duerst@it.aoyama.ac.jp>, public-ietf-collation@w3.org
Message-Id: <200510170930.j9H9UmlA072630@lab.smi.sendmail.com>
Arnt Gulbrandsen <arnt@gulbrandsen.priv.no> writes:
>Martin Duerst writes:
>> At 17:24 05/09/22, Arnt Gulbrandsen wrote:
>> >1. Collators should get octet strings from the protocol.
...
>> Up to now, my assumption was that all collations operate on character 
>> strings,
>
>(mine too)
>
>> and that the 'octet' collation was either a bad name or the exception 
>> that proved the rule (until recently, I didn't get much of an answer 
>> on that from Chris).
>
>i-octet, ascii-numeric and Cyrus' date collator (for sieve) persuade me 
>that this isn't so. The very raison d'etre for a collator is that it is 
>NOT strcmp(). The collator draft/RFC defines a small API, and a 
>collator is something that implements that API on a given data type. 
>That data type may be "Turkish unicode text" or it may be "email 
>addresses" or it may be "numbers" or it may be "arbitrary octet 
>strings" or it may be "US street addresses" or it may be "Swiss 
>telephone directory entries".

Yes and no.  With the possible exception of i;octet, all the collations
mentioned or documented operate on character strings, in the sense that
the charset of the input data must be taken into account.

For example, the i;ascii-numeric comparator should treat the three
character string "127" as equal to the number 127 whether or not the
string was originally encoded as 3 octets in US-ASCII, 6 octets in
UTF-16BE, or 16 bytes in UTF-32 (w/BOM).  In addition, IMHO, that
comparator should not fail on any input that was a valid character
string with at least one leading digit, even if the succeeding text
cannot all be mapped into US-ASCII.  I can't tell whether that's true
with the current text of section 9.1.1.

(Note that I _don't_ think that i;ascii-numeric should map the various
digit forms in Unicode down to the US-ASCII digits before performing its
comparison, mainly because such a definition would conflict with all
current practice for the comparator, but also because a comparator that
operated correctly on all the digits in Unicode (is bidi processing
needed too?) should have a name like "i;basic-numeric".)


So, for most comparators, there's an obvious and critical charset
handling step that must take place on the protocol side of the picture.
It takes place over there because it's protocol-specific how the charset
is specified for any given datum and the protocol needs to specify the
handling for data where the decoding/charset-mapping step fails.  For
example, the I-D of the Sieve revision (draft-ietf-sieve-3028bis-04)
says:
      Comparisons are performed in Unicode.  Implementations convert
      text from header fields in all charsets [MIME3] to Unicode as
      input to the comparator (see 2.7.3).  Implementations MUST be
      capable of converting US-ASCII, ISO-8859-1, the US-ASCII subset of
      ISO-8859-* character sets, and UTF-8.  Text that the
      implementation cannot convert to Unicode for any reason MAY be
      treated as plain US-ASCII (including any [MIME3] syntax) or
      processed according to local conventions.  An encoded NUL octet
      (character zero) SHOULD NOT cause early termination of the header
      content being compared against.

While the first sentence of that may be contentious (it used to say
they're performed in UTF-8), I think it's clear that the description of
the decoding, the suggested handling of conversion failure, and the list
of mandatory-to-implement charsets all belong here, in the "protocol"
spec, and not in the generic comparator doc and certainly not in the
individual comparator descriptions.


With the exception of i;octet, the comparator definitions can all be
written to be encoding-agnostic, operating on Unicode codepoints.
Indeed, the basic and nameprep collations already are specified that
way.  The trick with i;octet is that we want to be able to use it on
non-textual data, where there's no charset by which the octets can be
interpreted.  The protocol may still need to specify that the octet
string being matched against is the result of decoding an encoding (say,
the Content-Transfer-Encoding of a MIME part), but it's definitely
providing an octet stream sans charset.


To summarize, I think the following are clear requirements:

- for protcols that support matching against non-textual data (e.g.,
  Sieve 'body' extension, IMAP w/COMPARATOR):
   - the i;octet comparator MUST perform the obvious direct operation
   - the protocol profile MUST specify what decodings are performed to
     get the octets to match against (e.g., C-T-E decoding)
- for all protocols, when matching against textual data with a
  comparator _other_ than i;octet
   - charsets must be converted to Unicode
   - the protocol profile must specify what charsets are mandatory to
     implement

I'm 99% sure that non-textual data should be skipped/never-matched when
using a comparator other than i;octet.  Performing such a comparison
would implicitly require interpreting the non-textual data as being text
in some charset, which just doesn't have any real utility, IMHO.

What isn't obvious to me is how i;octet should behave with textual
data.  I can see two choices:
1) convert other charsets to Unicode, then match against the UTF-8
   encoding of that result
2) do no charset conversion; match against the raw (decoded) octets

I believe the existing practice in Sieve and ACAP is choice (1).  IMHO,
choice (2) is more consistent by virtue of clearly setting i;octet at a
level below charsets, but doing so leaves no way to perform a comparison
where charsets are converted but no canonicalization of the characters
is done.


>> No, my understanding would be that they have to parse the *character* 
>> string and work on the resulting value. (note that any serious 
>> collator has to in one way or another parse the string, e.g. to 
>> separate base letter, diacritics, and case, or whatever).
>
>If the collator gets a character string from the protocol, then a) the 
>protocol either cannot work on octets, or b) it has to weed out octet 
>strings that don't correspond to character strings before using the 
>collator.
>
>B is a design violation in my view. It implies that something outside 
>the collator performs a duty specified by the collator.

I see three levels of consideration.  From lowest to highest:

 - transfer decoding.  This is generates a string of octets and is
   completely protocol-specific, including the specification of how
   invalid encodings (e.g., a '$' in a base64 encoded part) should be
   handled.  This step is trivial for ACAP, as it's binary-clean, but
   IMAP and Sieve have to consider Content-Transfer-Encoding and RFC
   2047 tokens here.

 - charset (encoding) conversion.  Again, this is protocol-specific.
   For example, _everything_ in ACAP is in UTF-8, while IMAP and Sieve
   have to pull the charset from the Content-Type MIME field or from RFC
   2047 tokens.  The protocol should specify what to do if the octet
   string isn't actually valid in the charset encoding it claims to be
   in (e.g., octet 0x80 in UTF-8).

 - parsing of strings.  This is specific to the comparator.
   i;ascii-numeric looks at numeric prefixes, while a date comparator
   would perform some sort of parse (ISO 8601?  RFC 2822?  RFC 3339?) to
   determine what date is represented by the string.


Philip Guenther
Received on Monday, 17 October 2005 09:31:27 UTC