Re: I18N issues with the XML Specification from Rick Jelliffe on 2000-04-05 (xml-editor@w3.org from April to June 2000)

From: Rick Jelliffe <ricko@gate.sinica.edu.tw>
Date: Thu, 6 Apr 2000 02:48:28 +0800 (CST)
To: yergeau@alis.com
cc: xml-editor@w3.org, w3c-i18n-ig@w3.org
Message-ID: <Pine.GSO.4.21.0004060109280.21048-100000@gate>
On Wed, 5 Apr 2000, John Cowan wrote:

> Unfortunately no.  UTF-7 in effect defines two representations: plain ASCII
> (except for the "+" character) and plus-minus-wrapped-Base64-encoded, e.g.
> "+Jjo-" for U+263A.
...
> or even
> 
> 	+ADwAPwB4AG0AbAAgAHYAZQByAHMAaQBvAG4APQAiADEALgAwACIAIABlAG4AYwBvAG
> 	QAaQBuAGcAPQAiAHUAdABmAC0ANwAiAD8APg-

I don't think that makes a difference to my point. If the document was
created so that the writer generated the header
<?xml version="1.0"  encoding="UTF-7"?>
at the start, then the only thing that could make autodetection
unreliable is if there existed another into-ASCII encoding that
encoded its xml declaration with exactly the same ASCII characters.

"Unreliable" cannot mean "sometimes it will not work" because by that
definition all non UTF encodings are unreliable. "Unreliable" can only
mean "sometimes the wrong encoding will be detected" which does not
seem to be the case at all. (Except for one case below)  
 
So that makes two objections I suppose: first that "unreliable" is
the wrong term, and second that in any case it is not true: it is
possible to add code that would always detect that UTF-7 was being used.

Again, my point is that taking Appendix F as somehow limiting the
techniques that can be used for autodetection on the XML header is
bogus.  Autodetection relies on the document being unambiguously marked up
with enough bytes at the start to allow autodetection. It never goes into
guesswork and it is explicit. 

In the particular case of UTF-7, if there is a + before the first ?>, then
preprocess it through a UTF-7 decoder and see if the correct header
emerges. 100% reliable.

So instead of "this algorithm is not reliable", it should be "some
encodings (i.e. UTF-7) may require an extra decoding stage for
autodetection".  That is quite different.

Otherwise, what is being done is saying
 1) Autodetection algorithm is completely described by Appendix F
algorithm
 2) Appendix F algorithm does not handle some character encodings
 3) Therefore autodetection does not cope with some character encodings 
But the first step is wrong.

> > Why is it true that external parsed entities in UTF-16 may begin with any
> > character?
> 
> The nature of an external parsed entity is that although it has to be
> balanced with respect to tags, it may begin with character data.
> External parsed entities must match the production rule "content".

4.3.1 says "External parsed entities may each begin with a text
declaration".  Entity handling occurs prior to parsing. Therefore
autodetection must occur first. 

So if an external parsed entity does switch to UTF-8 or UTF-16 with no
xml header, is there any known string of code-points which could confuse
things?  Yes, if the entity started with the UTF-7 data given by John
above, then if that would be misdetected as a UTF-7 XML header
by a processor that understood UTF-7. 

So the lesson is that anyone who is worried about sending xml encoding PIs
encoded using UTF-7 at the start of an external parseable XML entity
should make sure they start their entity with an explicit XML encoding PI.
This problem arises from allowing a default encoding: if the data is
labelled explicitly there is no problem.  

In fact, what John is saying is not that UTF-7 detection is unreliable,
but that UTF-8 defaulting is (in at least one rare case) wrong.
 
> > That is a bug which should be fixed up. In the absense of
> > overriding higher-level out-of-band signalling, an XML entity must be
> > required to identify its encoding unambiguously.
> 
> Impossible in principle.  If you know absolutely nothing about the
> encoding, you cannot even read the encoding declaration.  Autodetection is
> and can be only a partial solution.

Rubbish. XML should be based on only allowing encodings that can be
autodetected. Infeasible encodings should not be allowed--if they do 
exist.

> > The wrong thing to do
> > would be to say "Autodetection is unreliable"--it must be reliable, and
> > the rest of XML 1.0 must not have anything that prevents it from being
> > reliable.
> 
> That is not XML 1.0.

As an official member comment from Academia Sinica, anything in XML 1.0
that suggests otherwise should be regarded as an error and fixed. Any new
text or errata must not give the impression that there is any known
situation in which it is not possible to mark a document up with a correct
encoding declaration which a receiving processor will be confused by.

Furthermore, I would ask that Appendix F recommend strongly the use of 
an XML header in all parseable entities, to prevent the problem with 
the unreliability of UTF-8, among other reasons.

> > To put it another way, if a character encoding cannot reliably be
> > autodetected, it should be banned from being used with XML. But I have
> > still yet to find any encodings that fit into this category.
> 
> At present, autodetection handles only:
> 
> 	UTF-8 (by default),
> 	various UTF-16 flavors (perhaps only UTF-16, maybe UTF-16BE/LE as well),
> 	various UTF-32 (UCS-4) flavors,
> 	ASCII-compatible encodings (guaranteed to encode the declaration in ASCII),
> 	EBCDIC encodings. 
> 
> This leaves UTF-7 out, since it is not guaranteed to encode the encoding declaration
> in ASCII.

Wrong, for the reasons above. Annex F is not normative, it does not define
or limit autodetection.

As long as the header has been added, the only encoding that is not
reliably detected is when there are two encodings whose relevant XML
declarations are the same: but the chances of this are slim. Detection can
always be reliable "I know this" or "I don't know this": it need never
be the former when the latter is true. 

I think the confusion arises from reading too much into the sentence in 
Appendix F that "Because the contents of the encoding declaration are
restricted to ASCII characters, a processor can reliably read the entire
encoding declaration as soon as it has detected which family of encodings 
is in use" to meant that 8-bit  autodetection promises itself to be a
function which always produces a result. 

To prove that autodetection is, in some circumstance, unreliable it is not
enough to show that one algorthm has a limit, it must be shown that there
are ambigous encodings. And even in that case (which I doubt exists) the 
solution is merely that the rarer of the encodings cannot be used for XML.

So, instead of the UTF-7 comments, it would be better to say in paragraph
1 of Appendix F 
 "each implementation is assumed to support and autodetect only 
 a finite set character encodings, and the XML encoding declaration is
 restricted in position and content in order to make it feasible to
 autodetect the character encoding in use in each entity in normal
 cases."
(I.e., add "and autodetect")



Rick Jelliffe
Received on Wednesday, 5 April 2000 14:49:35 UTC