- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Thu, 22 Sep 2005 19:34:41 +0900
- To: Arnt Gulbrandsen <arnt@gulbrandsen.priv.no>, public-ietf-collation@w3.org
- Cc: Philip Guenther <guenther+collation@sendmail.com>
At 17:24 05/09/22, Arnt Gulbrandsen wrote: >So, to answer Philip. > >I read the draft and pondered your confusion, but I didn't really understand until Cyrus talked about date collation. Thank you for uncovering this. > >1. Collators should get octet strings from the protocol. Sorry, but this assumption isn't generally true. In XQuery, collations are always applied to (Unicode) character strings. This is somewhat due to the fact that XQuery isn't a protocol. But there is no need to restrict the use of collators to protocols. >2. Collators operate on a collator-specified type. Those (most?) collators which operate on character strings Up to now, my assumption was that all collations operate on character strings, and that the 'octet' collation was either a bad name or the exception that proved the rule (until recently, I didn't get much of an answer on that from Chris). > have to convert the octet string to a character string. (For example, a collator which operates on unicode strings has to decode UTF-8 before it can sort.) Well, the sorting routine may be using UTF-8 internally ("Unicode string" doesn't imply a specific internal representation at all). >Some collators don't operate on character strings. Ascii-numeric is a case in point. Those have to parse the parse the octet string and work on the resulting value. No, my understanding would be that they have to parse the *character* string and work on the resulting value. (note that any serious collator has to in one way or another parse the string, e.g. to separate base letter, diacritics, and case, or whatever). >Cyrus Daboo mentioned a collator which sorts dates. That collator has to specify a date format (perhaps by reference), parse that format, and sort/compare the dates in its internal format. Still, the input would be a character string, wouldn't it? (internal date formats that are not character strings are usually constructed so that sorting is trivial, i.e. no parsing needed). >The ascii-numeric collator needs rewriting so it speaks of numeric comparison, rather than digit strings. No logical change, just a change of wording to emphasise the numeric nature of the objects more than the ASCII representation. (I'll specify unbounded integers. Not 32-bit, not 64-bit.) I'm not sure I agree. It looks like an interesting generalization, but I don't think we need to go that far just to solve the i;octet issue. Also, it no longer cover the issue of using a numeric collator for cases such as XQuery, and even simple cases such as a Unix sort command (immagine that it would come with an option to specify a collator for a field). >3. Any implementation is of course free to optimise. This is about the specification of collators only. > >I'll rewrite the draft to improve ascii-numeric, describe the split, specify what happens when the octet string doesn't follow the collator's expected format or isn't within the collator's domain, If you expand the model, there are a lot of other cases where formats may not match or there is a domain problem. Thinking about it a bit in the last few days, the i;octet collator's problem isn't the lack of domains, it's that there are two domains for it. As an example, consider a set of strings encoded in UTF-16-BE. Should i;octet be applied to the raw binary form, or should it be applied after converting to UTF-8. The later results in a simple ordering by Unicode codepoint, the former doesn't. We definitely need a predefined (and hopefully easy to understand) name for the later. If there is any protocol/format/language that needs the former, I think they should get it, but they should have to explicitly mention that, and they should be aware of the fact that they are committing a layer violation. This is not just theory: Many implementations these days read in data and convert it to (their preferred form of) Unicode before doing anything else with it, and having to reconstruct the original octet sequence may be impossible or extremely annoying. Regards, Martin.
Received on Thursday, 22 September 2005 10:36:46 UTC