- From: Martin J. Duerst <duerst@w3.org>
- Date: Wed, 12 Apr 2000 15:41:23 +0900
- To: John Cowan <jcowan@reutershealth.com>, Rick Jelliffe <ricko@gate.sinica.edu.tw>
- Cc: xml-editor@w3.org, yergeau@alis.com, w3c-i18n-ig@w3.org
At 00/04/05 12:09 -0400, John Cowan wrote: >Rick Jelliffe wrote: > > > UTF-7 can be handled by a smarter routine: as long as the label is present > > it can be reliably detected. Given that UTF-7 is more or less deprecated, and that we should try to keep the interactions between the XML spec and various 'charset's at a minimum, I think it's not worth to require any more complicated algorithm. >Unfortunately no. UTF-7 in effect defines two representations: plain ASCII >(except for the "+" character) and plus-minus-wrapped-Base64-encoded, e.g. >"+Jjo-" for U+263A. > >Unlike UTF-8 and friends, either representation may be used for most >ASCII characters, including those in the encoding declaration, and in zillions >of different ways. The encoding declaration > > <?xml version="1.0" encoding="utf-7"?> > >can be encoded as: >or even > > +ADwAPwB4AG0AbAAgAHYAZQByAHMAaQBvAG4APQAiADEALgAwACIAIABlAG4AYwBvAG > QAaQBuAGcAPQAiAHUAdABmAC0ANwAiAD8APg- So if you want to produce UTF-7 that is correctly labeled and is encoded so that it passes the ASCII-based heuristics, this is always possible using the first of your alternatives. My guess is also that converters to UTF-7 use +-Base64 in a defensive, rather than in an aggressive way, so that it's a minor problem in practice. In light of this, I consider the errata text at: http://www.w3.org/XML/xml-19980210-errata#E44 "Also, because of the overloaded usage it makes of ASCII-valued bytes, the UTF-7 encoding may fail to be reliably detected." as perfectly appropriate. Regards, Martin.
Received on Wednesday, 12 April 2000 04:12:46 UTC