W3C home > Mailing lists > Public > xml-editor@w3.org > April to June 2000

About UTF-7 (was: Re: I18N issues with the XML Specification)

From: Martin J. Duerst <duerst@w3.org>
Date: Wed, 12 Apr 2000 15:41:23 +0900
Message-Id: <4.2.0.58.J.20000412151859.033ea5c0@sh.w3.mag.keio.ac.jp>
To: John Cowan <jcowan@reutershealth.com>, Rick Jelliffe <ricko@gate.sinica.edu.tw>
Cc: xml-editor@w3.org, yergeau@alis.com, w3c-i18n-ig@w3.org
At 00/04/05 12:09 -0400, John Cowan wrote:
>Rick Jelliffe wrote:
>
> > UTF-7 can be handled by a smarter routine: as long as the label is present
> > it can be reliably detected.

Given that UTF-7 is more or less deprecated, and that we should try
to keep the interactions between the XML spec and various 'charset's
at a minimum, I think it's not worth to require any more complicated
algorithm.

>Unfortunately no.  UTF-7 in effect defines two representations: plain ASCII
>(except for the "+" character) and plus-minus-wrapped-Base64-encoded, e.g.
>"+Jjo-" for U+263A.
>
>Unlike UTF-8 and friends, either representation may be used for most
>ASCII characters, including those in the encoding declaration, and in zillions
>of different ways.  The encoding declaration
>
>         <?xml version="1.0" encoding="utf-7"?>
>
>can be encoded as:

>or even
>
>         +ADwAPwB4AG0AbAAgAHYAZQByAHMAaQBvAG4APQAiADEALgAwACIAIABlAG4AYwBvAG
>         QAaQBuAGcAPQAiAHUAdABmAC0ANwAiAD8APg-

So if you want to produce UTF-7 that is correctly labeled and
is encoded so that it passes the ASCII-based heuristics, this
is always possible using the first of your alternatives.
My guess is also that converters to UTF-7 use +-Base64 in a
defensive, rather than in an aggressive way, so that it's
a minor problem in practice.

In light of this, I consider the errata text at:

http://www.w3.org/XML/xml-19980210-errata#E44

"Also, because of the overloaded usage it makes of ASCII-valued bytes,
      the UTF-7 encoding may fail to be reliably detected."

as perfectly appropriate.



Regards,  Martin.
Received on Wednesday, 12 April 2000 04:12:46 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:59:30 GMT