Re: I18N issues with the XML Specification from John Cowan on 2000-04-05 (xml-editor@w3.org from April to June 2000)

From: John Cowan <jcowan@reutershealth.com>
Date: Wed, 05 Apr 2000 16:06:33 -0400
To: Rick Jelliffe <ricko@gate.sinica.edu.tw>
CC: yergeau@alis.com, xml-editor@w3.org, w3c-i18n-ig@w3.org
Message-ID: <38EB9CC9.3DC08D63@reutershealth.com>

The XML Rec says (clause 4.3.3):

# Although an XML processor is required to read only entities in the
# UTF-8 and UTF-16 encodings, it is recognized that other encodings
# are used around the world, and it may be desired for XML processors
# to read entities that use them.

This is not, nor should it be, limited to any particular set of encodings,
past, present, or future.  Nor is there any requirement that the bytes
representing the XML document are the only possible source of information
on the encoding.

Rick Jelliffe wrote:

> "Unreliable" cannot mean "sometimes it will not work" because by that
> definition all non UTF encodings are unreliable. "Unreliable" can only
> mean "sometimes the wrong encoding will be detected" which does not
> seem to be the case at all. (Except for one case below)

In principle and as XML 1.0 is written, there might be an encoding
named "UTF-+ADc-" in which case there would be no straightforward
way of discriminating between it and UTF-7 to a processor which understood
both.

> So that makes two objections I suppose: first that "unreliable" is
> the wrong term, and second that in any case it is not true: it is
> possible to add code that would always detect that UTF-7 was being used.

True but not relevant.  The meaning is that the procedure *of Appendix F*
does not reliably detect UTF-7.

> Again, my point is that taking Appendix F as somehow limiting the
> techniques that can be used for autodetection on the XML header is
> bogus.

Agreed.

> Autodetection relies on the document being unambiguously marked up
> with enough bytes at the start to allow autodetection. It never goes into
> guesswork and it is explicit.

That is true of Appendix F autodetection, which is explicitly described
as non-normative.  The most that is said is that autodetection is
"not entirely hopeless".

> In the particular case of UTF-7, if there is a + before the first ?>, then
> preprocess it through a UTF-7 decoder and see if the correct header
> emerges. 100% reliable.

Autodetection of UTF-7 is certainly possible, but not by the method of
Appendix F.

> > At present, autodetection handles only:
> >
> >       UTF-8 (by default),
> >       various UTF-16 flavors (perhaps only UTF-16, maybe UTF-16BE/LE as well),
> >       various UTF-32 (UCS-4) flavors,
> >       ASCII-compatible encodings (guaranteed to encode the declaration in ASCII),
> >       EBCDIC encodings.
> >
> > This leaves UTF-7 out, since it is not guaranteed to encode the encoding declaration
> > in ASCII.
> 
> Wrong, for the reasons above. Annex F is not normative, it does not define
> or limit autodetection.

By "autodetection" I meant "autodetection by the method of Appendix F".

> To prove that autodetection is, in some circumstance, unreliable it is not
> enough to show that one algorthm has a limit, it must be shown that there
> are ambigous encodings. And even in that case (which I doubt exists) the
> solution is merely that the rarer of the encodings cannot be used for XML.

Any encoding that the processor accepts can be used for XML.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)

Received on Wednesday, 5 April 2000 16:06:54 UTC