About UTF-7 (was: Re: I18N issues with the XML Specification)

At 00/04/05 12:09 -0400, John Cowan wrote:
>Rick Jelliffe wrote:
>
> > UTF-7 can be handled by a smarter routine: as long as the label is present
> > it can be reliably detected.

Given that UTF-7 is more or less deprecated, and that we should try
to keep the interactions between the XML spec and various 'charset's
at a minimum, I think it's not worth to require any more complicated
algorithm.

>Unfortunately no.  UTF-7 in effect defines two representations: plain ASCII
>(except for the "+" character) and plus-minus-wrapped-Base64-encoded, e.g.
>"+Jjo-" for U+263A.
>
>Unlike UTF-8 and friends, either representation may be used for most
>ASCII characters, including those in the encoding declaration, and in zillions
>of different ways.  The encoding declaration
>
>         <?xml version="1.0" encoding="utf-7"?>
>
>can be encoded as:

>or even
>
>         +ADwAPwB4AG0AbAAgAHYAZQByAHMAaQBvAG4APQAiADEALgAwACIAIABlAG4AYwBvAG
>         QAaQBuAGcAPQAiAHUAdABmAC0ANwAiAD8APg-

So if you want to produce UTF-7 that is correctly labeled and
is encoded so that it passes the ASCII-based heuristics, this
is always possible using the first of your alternatives.
My guess is also that converters to UTF-7 use +-Base64 in a
defensive, rather than in an aggressive way, so that it's
a minor problem in practice.

In light of this, I consider the errata text at:

http://www.w3.org/XML/xml-19980210-errata#E44

"Also, because of the overloaded usage it makes of ASCII-valued bytes,
      the UTF-7 encoding may fail to be reliably detected."

as perfectly appropriate.



Regards,  Martin.

Received on Wednesday, 12 April 2000 04:12:46 UTC