comments on draft-newman-i18n-comparator-05.txt from Philip Guenther on 2005-09-22 (public-ietf-collation@w3.org from September 2005)

From: Philip Guenther <guenther+collation@sendmail.com>
Date: Wed, 21 Sep 2005 22:15:20 -0700
To: public-ietf-collation@w3.org
Message-Id: <200509220515.j8M5FKS5027426@lab.smi.sendmail.com>

[The following is rather rambling and mostly centers around the
problems created by the mixing of 'characters' in the abstract with
the "i;octet" collation.  This all makes my head hurt, so please
forgive its fragmented nature.  Those who saw an earlier version
of this message will find the first paragraph much clearer now.]


If I'm following the history of this document correctly, it appears
that the character strings referred to throughout the document are
all assumed to be in Unicode.  Previous to revision -02, everything
was in UTF-8, so that was clearly the intent then.  At the least,
it isn't actually specified what input range a collation is required
to define its behavior on or what the default handling is for
characters which are not covered by the collations definition.

This problem can be seen in the definition of "en;ascii-casemap".
Is it an error to supply characters outside the US-ASCII charset
to that collation?  It seems to be defined as operating on octets
instead of characters, given how it's defined in terms of the
"i;octet" collation.

Similarly, the fallback to "i;octet" when sorting invalid strings
(section 4.3, paragraph 2) doesn't really make sense when sorting
is operating on characters instead of octets.  This is especially
confusing when section 5.7 says that it's protocol dependent whether
"i;octet" is supported.

Speaking of 5.7, it needs to say that _if_ a protocol permits the
use of the "i;octet" collation, then the protocol must specify how
the octets to be compared are produced.  While protocols that only
operate on textual data can simply say "encode characters in UTF-8",
protocols that can match against non-textual data may want to match
against the 'raw octets' of the data, as seen in the IMAP example
in section 4.3 of draft-ietf-imapext-i18n-05.  Or is the model that
the octets of the GIF are converted (interpreted?) as characters
so that they can be fed to the collation...which then converts
(reinterprets?) them back as octets?

I hit this trying to figure out what the sieve 'body' extension
needs to say about matching against non-textual parts.  Both there
and in IMAP SEARCH, it can be argued that searches against non-textual
parts should only match when the "i;octet" collation is being used.
Other collations require characters, and the octets of, for example,
an image/jpeg part, have no encoding to map them to characters.


Actually, one could argue that "i;octet" should _only_ be used on
non-textual data and the trivial comparison of characters should
be called something like "i;trivial", because it isn't comparing
octets, but that's a significant break with the past.


Section 3.1 should specify that collation names are case-sensitive,
or at least I've been _assuming_ they are.


Philip Guenther

Received on Thursday, 22 September 2005 05:15:50 UTC