Re: comments on draft-newman-i18n-comparator-05.txt from Dave Cridland on 2005-10-20 (public-ietf-collation@w3.org from October 2005)

From: Dave Cridland <dave@cridland.net>
Date: Thu, 20 Oct 2005 11:39:28 +0100
To: Philip Guenther <guenther+collation@sendmail.com>
Cc: Martin Duerst <duerst@it.aoyama.ac.jp>, <public-ietf-collation@w3.org>, Arnt Gulbrandsen <arnt@gulbrandsen.priv.no>
Message-Id: <4865.1129804769.554912@peirce.dave.cridland.net>
On Mon Oct 17 10:30:48 2005, Philip Guenther wrote:
> Arnt Gulbrandsen <arnt@gulbrandsen.priv.no> writes:
> >Martin Duerst writes:
> >> At 17:24 05/09/22, Arnt Gulbrandsen wrote:
> >> >1. Collators should get octet strings from the protocol.
> ...
> >> Up to now, my assumption was that all collations operate on 
> character >> strings,
> >
> >(mine too)
> >
> >> and that the 'octet' collation was either a bad name or the 
> exception >> that proved the rule (until recently, I didn't get 
> much of an answer >> on that from Chris).
> >
> >i-octet, ascii-numeric and Cyrus' date collator (for sieve) 
> persuade me >that this isn't so. The very raison d'etre for a 
> collator is that it is >NOT strcmp(). The collator draft/RFC 
> defines a small API, and a >collator is something that implements 
> that API on a given data type. >That data type may be "Turkish 
> unicode text" or it may be "email >addresses" or it may be 
> "numbers" or it may be "arbitrary octet >strings" or it may be "US 
> street addresses" or it may be "Swiss >telephone directory entries".
> 
> Yes and no.  With the possible exception of i;octet, all the 
> collations
> mentioned or documented operate on character strings, in the sense 
> that
> the charset of the input data must be taken into account.
> 
> 
It isn't, actually. Comparators in ACAP operate on octet strings, and 
they have an explicit octet-by-octet decoding rule, plus a defined 
handler for when the decoding fails (either they stop processing the 
input at that point, or they pass it through).

So if someone were to store a UCS2 numeric string in ACAP, the output 
of a "i;ascii-numeric" comparator is NIL, rather than an integer.

If you prefer, the charset is implicit in the definition of the 
comparator, in this case "ascii", hence the name.

"i;octet" is only unique in as much as it has no charset at all, 
otherwise it's exactly the same as any other comparator.

> To summarize, I think the following are clear requirements:
> 
> - for protcols that support matching against non-textual data (e.g.,
>   Sieve 'body' extension, IMAP w/COMPARATOR):
>    - the i;octet comparator MUST perform the obvious direct 
> operation
>    - the protocol profile MUST specify what decodings are performed 
> to
>      get the octets to match against (e.g., C-T-E decoding)

Agreed.


> - for all protocols, when matching against textual data with a
>   comparator _other_ than i;octet
>    - charsets must be converted to Unicode
>    - the protocol profile must specify what charsets are mandatory 
> to
>      implement
> 
> 
Disagreed.

Thus far, all comparators effectively define the character set, if 
any, which they operate on. That's not to say that it's not 
reasonable to assume that comparators may be able to operate on 
character strings, but it does mean that they should have the same 
semantics, thus we should probably propose that, for character string 
orientated protocols, there is an additional step of encoding to 
UTF-8, which may well be elided in practise for the majority of 
comparators.

In other words, the method for performing a substring match using 
"i;octet", providing it with two character strings, is to first 
encode the two character strings as UTF-8, and then subsequently 
perform an octet substring match.

In practise, this is identical to performing a character substring 
match on the two character strings, so there's no need to actually 
perform the encoding. It's the same with "i;ascii-casemap"[1] and 
"i;ascii-numeric", too. (Hoorah!)

The difference comes when data is inserted in UCS2 into ACAP - which 
is possible, although not recommended. The result would be that 
"i;octet" and "i;ascii-casemap" worked as expected, but 
"i;ascii-numeric" wouldn't.

Thus an API needs to operate on octet strings, but would presumably 
provide a means to operate on character strings, which may provide an 
optimized facility, but it needn't - it need only encode to UTF-8 and 
pass on to the comparator.

What you can't do is do it the other way around - since comparators 
can operate on data which is only partially textual, or possibly not 
textual at all.


> I'm 99% sure that non-textual data should be skipped/never-matched 
> when
> using a comparator other than i;octet.  Performing such a comparison
> would implicitly require interpreting the non-textual data as being 
> text
> in some charset, which just doesn't have any real utility, IMHO.
> 
> 
Some attributes in ACAP are defined as, for instance, containing NUL 
octets. These are still useful to use comparators on, because they 
could contain textual strings somewhere.

It's perfectly legal to compare, for instance, "NUL HERE\0" with "nul 
here\0" using EQUAL with "i;ascii-casemap", and get a match. A real 
world example is searching for email addresses in the addressbook 
dataset - email attributes contain an email address followed by an 
optional NUL and a usage, so to find them, you do a "i;ascii-casemap" 
search for both EQUAL and PREFIX, the latter having a NUL appended.


> What isn't obvious to me is how i;octet should behave with textual
> data.  I can see two choices:
> 1) convert other charsets to Unicode, then match against the UTF-8
>    encoding of that result
> 2) do no charset conversion; match against the raw (decoded) octets
> 
> I believe the existing practice in Sieve and ACAP is choice (1).  
> IMHO,

You're assuming that ACAP has any notion of the character set of an 
attribute value. This is an incorrect assumption - attribute values 
are octet strings, and whilst most defined attributes happen to be 
UTF-8 or US-ASCII, not all of them are.

>  - charset (encoding) conversion.  Again, this is protocol-specific.
>    For example, _everything_ in ACAP is in UTF-8, while IMAP and 
> Sieve

No, not true. ACAP leans toward, and encourages, UTF-8, but doesn't 
enforce it for the majority of attributes. Everything is, in fact, an 
octet string. Attributes such as "entry" are actually UTF-8 encoded, 
but "subdataset" and "modtime" are US-ASCII.

Right now, comparators work on octet strings in ACAP - all of them. 
It's too late to change, and there's no need - those comparators that 
do extract characters out of the octet string are defined to operate 
on a particular character set.

That character set happens to be UTF-8 compatible, so that seems to 
me to be the way to go.

Dave.

[1] - "i;ascii-casemap", not "en;ascii-casemap", for two reasons: 
First, that's what it was originally called. Second, because it would 
be naïve to think that English has no accented characters - US 
English happens not to, so feel free to call it "en-us;ascii-casemap" 
if you really insist, but personally, I think "ascii" is enough of a 
hint to suggest that it probably cannot case-fold accented 
characters...
-- 
           You see things; and you say "Why?"
   But I dream things that never were; and I say "Why not?"
    - George Bernard Shaw
Received on Thursday, 20 October 2005 10:39:43 UTC