- From: Rick Jelliffe <ricko@gate.sinica.edu.tw>
- Date: Thu, 6 Apr 2000 02:48:28 +0800 (CST)
- To: yergeau@alis.com
- cc: xml-editor@w3.org, w3c-i18n-ig@w3.org
On Wed, 5 Apr 2000, John Cowan wrote: > Unfortunately no. UTF-7 in effect defines two representations: plain ASCII > (except for the "+" character) and plus-minus-wrapped-Base64-encoded, e.g. > "+Jjo-" for U+263A. ... > or even > > +ADwAPwB4AG0AbAAgAHYAZQByAHMAaQBvAG4APQAiADEALgAwACIAIABlAG4AYwBvAG > QAaQBuAGcAPQAiAHUAdABmAC0ANwAiAD8APg- I don't think that makes a difference to my point. If the document was created so that the writer generated the header <?xml version="1.0" encoding="UTF-7"?> at the start, then the only thing that could make autodetection unreliable is if there existed another into-ASCII encoding that encoded its xml declaration with exactly the same ASCII characters. "Unreliable" cannot mean "sometimes it will not work" because by that definition all non UTF encodings are unreliable. "Unreliable" can only mean "sometimes the wrong encoding will be detected" which does not seem to be the case at all. (Except for one case below) So that makes two objections I suppose: first that "unreliable" is the wrong term, and second that in any case it is not true: it is possible to add code that would always detect that UTF-7 was being used. Again, my point is that taking Appendix F as somehow limiting the techniques that can be used for autodetection on the XML header is bogus. Autodetection relies on the document being unambiguously marked up with enough bytes at the start to allow autodetection. It never goes into guesswork and it is explicit. In the particular case of UTF-7, if there is a + before the first ?>, then preprocess it through a UTF-7 decoder and see if the correct header emerges. 100% reliable. So instead of "this algorithm is not reliable", it should be "some encodings (i.e. UTF-7) may require an extra decoding stage for autodetection". That is quite different. Otherwise, what is being done is saying 1) Autodetection algorithm is completely described by Appendix F algorithm 2) Appendix F algorithm does not handle some character encodings 3) Therefore autodetection does not cope with some character encodings But the first step is wrong. > > Why is it true that external parsed entities in UTF-16 may begin with any > > character? > > The nature of an external parsed entity is that although it has to be > balanced with respect to tags, it may begin with character data. > External parsed entities must match the production rule "content". 4.3.1 says "External parsed entities may each begin with a text declaration". Entity handling occurs prior to parsing. Therefore autodetection must occur first. So if an external parsed entity does switch to UTF-8 or UTF-16 with no xml header, is there any known string of code-points which could confuse things? Yes, if the entity started with the UTF-7 data given by John above, then if that would be misdetected as a UTF-7 XML header by a processor that understood UTF-7. So the lesson is that anyone who is worried about sending xml encoding PIs encoded using UTF-7 at the start of an external parseable XML entity should make sure they start their entity with an explicit XML encoding PI. This problem arises from allowing a default encoding: if the data is labelled explicitly there is no problem. In fact, what John is saying is not that UTF-7 detection is unreliable, but that UTF-8 defaulting is (in at least one rare case) wrong. > > That is a bug which should be fixed up. In the absense of > > overriding higher-level out-of-band signalling, an XML entity must be > > required to identify its encoding unambiguously. > > Impossible in principle. If you know absolutely nothing about the > encoding, you cannot even read the encoding declaration. Autodetection is > and can be only a partial solution. Rubbish. XML should be based on only allowing encodings that can be autodetected. Infeasible encodings should not be allowed--if they do exist. > > The wrong thing to do > > would be to say "Autodetection is unreliable"--it must be reliable, and > > the rest of XML 1.0 must not have anything that prevents it from being > > reliable. > > That is not XML 1.0. As an official member comment from Academia Sinica, anything in XML 1.0 that suggests otherwise should be regarded as an error and fixed. Any new text or errata must not give the impression that there is any known situation in which it is not possible to mark a document up with a correct encoding declaration which a receiving processor will be confused by. Furthermore, I would ask that Appendix F recommend strongly the use of an XML header in all parseable entities, to prevent the problem with the unreliability of UTF-8, among other reasons. > > To put it another way, if a character encoding cannot reliably be > > autodetected, it should be banned from being used with XML. But I have > > still yet to find any encodings that fit into this category. > > At present, autodetection handles only: > > UTF-8 (by default), > various UTF-16 flavors (perhaps only UTF-16, maybe UTF-16BE/LE as well), > various UTF-32 (UCS-4) flavors, > ASCII-compatible encodings (guaranteed to encode the declaration in ASCII), > EBCDIC encodings. > > This leaves UTF-7 out, since it is not guaranteed to encode the encoding declaration > in ASCII. Wrong, for the reasons above. Annex F is not normative, it does not define or limit autodetection. As long as the header has been added, the only encoding that is not reliably detected is when there are two encodings whose relevant XML declarations are the same: but the chances of this are slim. Detection can always be reliable "I know this" or "I don't know this": it need never be the former when the latter is true. I think the confusion arises from reading too much into the sentence in Appendix F that "Because the contents of the encoding declaration are restricted to ASCII characters, a processor can reliably read the entire encoding declaration as soon as it has detected which family of encodings is in use" to meant that 8-bit autodetection promises itself to be a function which always produces a result. To prove that autodetection is, in some circumstance, unreliable it is not enough to show that one algorthm has a limit, it must be shown that there are ambigous encodings. And even in that case (which I doubt exists) the solution is merely that the rarer of the encodings cannot be used for XML. So, instead of the UTF-7 comments, it would be better to say in paragraph 1 of Appendix F "each implementation is assumed to support and autodetect only a finite set character encodings, and the XML encoding declaration is restricted in position and content in order to make it feasible to autodetect the character encoding in use in each entity in normal cases." (I.e., add "and autodetect") Rick Jelliffe
Received on Wednesday, 5 April 2000 14:49:35 UTC