Re: comments on draft-newman-i18n-comparator-05.txt from Martin Duerst on 2005-09-22 (public-ietf-collation@w3.org from September 2005)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Thu, 22 Sep 2005 19:34:41 +0900
To: Arnt Gulbrandsen <arnt@gulbrandsen.priv.no>, public-ietf-collation@w3.org
Cc: Philip Guenther <guenther+collation@sendmail.com>
Message-Id: <6.0.0.20.2.20050922191320.09496ec0@localhost>

At 17:24 05/09/22, Arnt Gulbrandsen wrote:

 >So, to answer Philip.
 >
 >I read the draft and pondered your confusion, but I didn't really 
understand until Cyrus talked about date collation. Thank you for 
uncovering this.
 >
 >1. Collators should get octet strings from the protocol.

Sorry, but this assumption isn't generally true. In XQuery,
collations are always applied to (Unicode) character strings.
This is somewhat due to the fact that XQuery isn't a protocol.
But there is no need to restrict the use of collators to protocols.

 >2. Collators operate on a collator-specified type. Those (most?) 
collators which operate on character strings

Up to now, my assumption was that all collations operate on character
strings, and that the 'octet' collation was either a bad name or
the exception that proved the rule (until recently, I didn't get
much of an answer on that from Chris).

 > have to convert the octet string to a character string. (For example, a 
collator which operates on unicode strings has to decode UTF-8 before it 
can sort.)

Well, the sorting routine may be using UTF-8 internally ("Unicode string"
doesn't imply a specific internal representation at all).

 >Some collators don't operate on character strings. Ascii-numeric is a 
case in point. Those have to parse the parse the octet string and work on 
the resulting value.

No, my understanding would be that they have to parse the
*character* string and work on the resulting value.
(note that any serious collator has to in one way or
another parse the string, e.g. to separate base letter,
diacritics, and case, or whatever).

 >Cyrus Daboo mentioned a collator which sorts dates. That collator has to 
specify a date format (perhaps by reference), parse that format, and 
sort/compare the dates in its internal format.

Still, the input would be a character string, wouldn't it?
(internal date formats that are not character strings
are usually constructed so that sorting is trivial, i.e.
no parsing needed).

 >The ascii-numeric collator needs rewriting so it speaks of numeric 
comparison, rather than digit strings. No logical change, just a change of 
wording to emphasise the numeric nature of the objects more than the ASCII 
representation. (I'll specify unbounded integers. Not 32-bit, not 64-bit.)

I'm not sure I agree. It looks like an interesting generalization,
but I don't think we need to go that far just to solve the i;octet
issue. Also, it no longer cover the issue of using a numeric
collator for cases such as XQuery, and even simple cases such
as a Unix sort command (immagine that it would come with an
option to specify a collator for a field).

 >3. Any implementation is of course free to optimise. This is about the 
specification of collators only.
 >
 >I'll rewrite the draft to improve ascii-numeric, describe the split, 
specify what happens when the octet string doesn't follow the collator's 
expected format or isn't within the collator's domain,

If you expand the model, there are a lot of other cases where formats
may not match or there is a domain problem.

Thinking about it a bit in the last few days, the i;octet collator's
problem isn't the lack of domains, it's that there are two domains
for it. As an example, consider a set of strings encoded in UTF-16-BE.
Should i;octet be applied to the raw binary form, or should it be
applied after converting to UTF-8. The later results in a simple
ordering by Unicode codepoint, the former doesn't.

We definitely need a predefined (and hopefully easy to understand)
name for the later. If there is any protocol/format/language that
needs the former, I think they should get it, but they should have
to explicitly mention that, and they should be aware of the fact
that they are committing a layer violation.

This is not just theory: Many implementations these days read
in data and convert it to (their preferred form of) Unicode
before doing anything else with it, and having to reconstruct
the original octet sequence may be impossible or extremely
annoying.

Regards,    Martin.

Received on Thursday, 22 September 2005 10:36:46 UTC