- From: Karl Dubost <karl@w3.org>
- Date: Tue, 16 Jan 2001 10:46:11 +0100
- To: xml-editor@w3.org
- Cc: duerst@w3.org, hugo@w3.org
In the process of writing a note, we (Hugo Haas and Karl Dubost) were faced to a problem about behavior of a browser when there is an encoding character error inside an XML document. For HTML the sequence is well defined but in XML it's unclear. I send you the mail we have discussed on this topic, plus the comments from Martin Duerst in [Martin: ....]. I would know if you have some comments to this issue, if it's an issue. Thanks. ******************************************** 1. For HTML, it's quite easy. http://www.w3.org/TR/html401/charset.html It's very clear http://www.w3.org/TR/html401/charset.html#h-5.2.2 "How does a user agent know which character encoding has been used?" 1.1 "The server should provide this information". -> HTTP Header. 1.2 else if , "Therefore, user agents must not assume any default value for the "charset" parameter." 1.3 But "HTML documents may include explicit information about the document's character encoding" -> META 1.4 else if, "For cases where neither the HTTP protocol nor the META element provides information about the character encoding of a document, HTML also provides the charset attribute on several elements. " 1.5 else if, "the user agent may use heuristics and user settings." 1.6 else if, "User agents may provide a mechanism that allows users to override incorrect "charset" information." As you can read, there a suite of Test and conditions with priorities order. 2. For XML, it's not the same. (when you read the spec). It's difficult to see if it's a bug or not in the spec or a point that is ambiguous. Character encoding http://www.w3.org/TR/REC-xml#charencoding 2.1 XML processor (user agent could be an XML processor) must read UTF-8 and UTF-16 -> ok 2.2 "Although an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them." -> ok 2.3 "In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration" -> for example <?xml version="1.0" encoding="iso-8859-1"?> -> ok 2.4 "In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is an error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8." -> error = A violation of the rules of this specification; results are undefined. Conforming software may detect and report an error and may recover from it. (http://www.w3.org/TR/REC-xml#dt-error) -> results are undefined, may recover but how ????? [Martin: ******** This is indeed a bit strange, in that the definition for 'error' implies that it's an error in the XML document, whereas the text above speaks about an error (i.e. erroneous behaviour) of the XML processor. Maybe it would be better to reword this as 'an XML processor is not allowed to...', because if the XML processor is already committing the error, then there is probably no need for additional recovery on top of that. I think it might be a good idea to send this one to the xml-editor list, can you please do that? ******] and a paragraph later 2.5 "It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains octet sequences that are not legal in that encoding. It is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16." -> fatal error = "An error which a conforming XML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document's logical structure to the application in the normal way)" -> ok the processor *** must not *** continue. What the processor will do! Flush the application? advertise the user? [Martin: ******** I think part of your problem is that we don't have any idea yet about what a fatal error means in the context of an XML browser. I think this could be a very interesting topic for you. The reason the XML spec is in a sense so vague here is that it has to work for all kinds of processors. This is not a problem with character encoding, but much more general. What would you expect a browser to do when it sees something like <i>...<b>...</i>...</b>..., obviously not well-formed XML ? ******] -- Karl Dubost / W3C - Conformance Manager http://www.w3.org/ --- Be Strict To Be Cool! ---
Received on Tuesday, 16 January 2001 04:49:33 UTC