Re: I18N issues with the XML Specification from John Cowan on 2000-04-05 (xml-editor@w3.org from April to June 2000)

From: John Cowan <jcowan@reutershealth.com>
Date: Wed, 05 Apr 2000 12:09:35 -0400
To: Rick Jelliffe <ricko@gate.sinica.edu.tw>
CC: xml-editor@w3.org, yergeau@alis.com, w3c-i18n-ig@w3.org
Message-ID: <38EB653F.AFF0EA39@reutershealth.com>

Rick Jelliffe wrote:

> UTF-7 can be handled by a smarter routine: as long as the label is present
> it can be reliably detected. Rather than say that UTF-7 may be unreliable,
> it would be better to put in an example of how it can be detected
> reliably, or to remain silent. It is not the general algorithm (find
> signature, read text according to encoding family, parse the text to
> find encoding attribute) that is faulty, it is that for UTF-7 the last
> stage (parsing) is not specified in this version of Appendix F. (UTF-7
> text can still be parsed as ASCII but using different delimiter
> recognition, surely.)

Unfortunately no.  UTF-7 in effect defines two representations: plain ASCII
(except for the "+" character) and plus-minus-wrapped-Base64-encoded, e.g.
"+Jjo-" for U+263A.

Unlike UTF-8 and friends, either representation may be used for most
ASCII characters, including those in the encoding declaration, and in zillions
of different ways.  The encoding declaration

	<?xml version="1.0" encoding="utf-7"?>

can be encoded as:

	<?xml version="1.0" encoding="+AHUAdABmAC0ANw-"?>

or

	+ADwAPwB4AG0AbA- version="1.0" encoding="utf-7"?>

or even

	+ADwAPwB4AG0AbAAgAHYAZQByAHMAaQBvAG4APQAiADEALgAwACIAIABlAG4AYwBvAG
	QAaQBuAGcAPQAiAHUAdABmAC0ANwAiAD8APg-

> Why is it true that external parsed entities in UTF-16 may begin with any
> character?

The nature of an external parsed entity is that although it has to be
balanced with respect to tags, it may begin with character data.
External parsed entities must match the production rule "content".

> That is a bug which should be fixed up. In the absense of
> overriding higher-level out-of-band signalling, an XML entity must be
> required to identify its encoding unambiguously.

Impossible in principle.  If you know absolutely nothing about the
encoding, you cannot even read the encoding declaration.  Autodetection is
and can be only a partial solution.

> The wrong thing to do
> would be to say "Autodetection is unreliable"--it must be reliable, and
> the rest of XML 1.0 must not have anything that prevents it from being
> reliable.

That is not XML 1.0.

> To put it another way, if a character encoding cannot reliably be
> autodetected, it should be banned from being used with XML. But I have
> still yet to find any encodings that fit into this category.

At present, autodetection handles only:

	UTF-8 (by default),
	various UTF-16 flavors (perhaps only UTF-16, maybe UTF-16BE/LE as well),
	various UTF-32 (UCS-4) flavors,
	ASCII-compatible encodings (guaranteed to encode the declaration in ASCII),
	EBCDIC encodings. 

This leaves UTF-7 out, since it is not guaranteed to encode the encoding declaration
in ASCII.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)

Received on Wednesday, 5 April 2000 12:09:31 UTC