W3C home > Mailing lists > Public > xml-editor@w3.org > April to June 2000

About UTF-7 (was: Re: I18N issues with the XML Specification)

From: Martin J. Duerst <duerst@w3.org>
Date: Wed, 12 Apr 2000 15:41:23 +0900
Message-Id: <>
To: John Cowan <jcowan@reutershealth.com>, Rick Jelliffe <ricko@gate.sinica.edu.tw>
Cc: xml-editor@w3.org, yergeau@alis.com, w3c-i18n-ig@w3.org
At 00/04/05 12:09 -0400, John Cowan wrote:
>Rick Jelliffe wrote:
> > UTF-7 can be handled by a smarter routine: as long as the label is present
> > it can be reliably detected.

Given that UTF-7 is more or less deprecated, and that we should try
to keep the interactions between the XML spec and various 'charset's
at a minimum, I think it's not worth to require any more complicated

>Unfortunately no.  UTF-7 in effect defines two representations: plain ASCII
>(except for the "+" character) and plus-minus-wrapped-Base64-encoded, e.g.
>"+Jjo-" for U+263A.
>Unlike UTF-8 and friends, either representation may be used for most
>ASCII characters, including those in the encoding declaration, and in zillions
>of different ways.  The encoding declaration
>         <?xml version="1.0" encoding="utf-7"?>
>can be encoded as:

>or even

So if you want to produce UTF-7 that is correctly labeled and
is encoded so that it passes the ASCII-based heuristics, this
is always possible using the first of your alternatives.
My guess is also that converters to UTF-7 use +-Base64 in a
defensive, rather than in an aggressive way, so that it's
a minor problem in practice.

In light of this, I consider the errata text at:


"Also, because of the overloaded usage it makes of ASCII-valued bytes,
      the UTF-7 encoding may fail to be reliably detected."

as perfectly appropriate.

Regards,  Martin.
Received on Wednesday, 12 April 2000 04:12:46 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:37:39 UTC