XML Rec and Error behavior from Karl Dubost on 2001-01-16 (xml-editor@w3.org from January to March 2001)

From: Karl Dubost <karl@w3.org>
Date: Tue, 16 Jan 2001 10:46:11 +0100
To: xml-editor@w3.org
Cc: duerst@w3.org, hugo@w3.org
Message-Id: <p05010402b689c6e6d10d@[138.96.249.69]>
In the process of writing a note, we (Hugo Haas and Karl Dubost) were 
faced to a problem about behavior of a browser when there is an 
encoding character error inside an XML document.

For HTML the sequence is well defined but in XML it's unclear.

I send you the mail we have discussed on this topic, plus the 
comments from Martin Duerst in [Martin: ....]. I would know if you 
have some comments to this issue, if it's an issue. Thanks.


********************************************
1. For HTML, it's quite easy.
http://www.w3.org/TR/html401/charset.html

It's very clear http://www.w3.org/TR/html401/charset.html#h-5.2.2

"How does a user agent know which character encoding has been used?"
1.1 "The server should provide this information". -> HTTP Header.
1.2 else if , "Therefore, user agents must not assume any default 
value for the "charset" parameter."
1.3 But "HTML documents may include explicit information about the 
document's character encoding" -> META
1.4 else if, "For cases where neither the HTTP protocol nor the META 
element provides information about the character encoding of a 
document, HTML also provides the charset attribute on several 
elements. "
1.5 else if, "the user agent may use heuristics and user settings."
1.6 else if, "User agents may provide a mechanism that allows users 
to override incorrect "charset" information."

As you can read, there a suite of Test and conditions with priorities order.

2. For XML, it's not the same. (when you read the spec). It's 
difficult to see if it's a bug or not in the spec or a point that is 
ambiguous.

Character encoding http://www.w3.org/TR/REC-xml#charencoding

2.1 XML processor (user agent could be an XML processor) must read 
UTF-8 and UTF-16
	-> ok

2.2 "Although an XML processor is required to read only entities in 
the UTF-8 and UTF-16 encodings, it is recognized that other encodings 
are used around the world, and it may be desired for XML processors 
to read entities that use them."
	-> ok

2.3 "In the absence of external character encoding information (such 
as MIME headers), parsed entities which are stored in an encoding 
other than UTF-8 or UTF-16 must begin with a text declaration"
	-> for example <?xml version="1.0" encoding="iso-8859-1"?>
	-> ok

2.4 "In the absence of information provided by an external transport 
protocol (e.g. HTTP or MIME), it is an error for an entity including 
an encoding declaration to be presented to the XML processor in an 
encoding other than that named in the declaration, or for an entity 
which begins with neither a Byte Order Mark nor an encoding 
declaration to use an encoding other than UTF-8."
	-> error = A violation of the rules of this specification; 
results are undefined. Conforming software may detect and report an 
error and may recover from it. (http://www.w3.org/TR/REC-xml#dt-error)
	-> results are undefined, may recover but how ?????

[Martin:
********
This is indeed a bit strange, in that the definition for 'error' implies
that it's an error in the XML document, whereas the text above speaks
about an error (i.e. erroneous behaviour) of the XML processor.
Maybe it would be better to reword this as 'an XML processor is not
allowed to...', because if the XML processor is already committing the
error, then there is probably no need for additional recovery on top
of that.

I think it might be a good idea to send this one to the xml-editor list,
can you please do that?
******]

and a paragraph later

2.5 "It is a fatal error when an XML processor encounters an entity 
with an encoding that it is unable to process. It is a fatal error if 
an XML entity is determined (via default, encoding declaration, or 
higher-level protocol) to be in a certain encoding but contains octet 
sequences that are not legal in that encoding. It is also a fatal 
error if an XML entity contains no encoding declaration and its 
content is not legal UTF-8 or UTF-16."

	-> fatal error = "An error which a conforming XML processor 
must detect and report to the application. After encountering a fatal 
error, the processor may continue processing the data to search for 
further errors and may report such errors to the application. In 
order to support correction of errors, the processor may make 
unprocessed data from the document (with intermingled character data 
and markup) available to the application. Once a fatal error is 
detected, however, the processor must not continue normal processing 
(i.e., it must not continue to pass character data and information 
about the document's logical structure to the application in the 
normal way)"
	-> ok the processor *** must not *** continue. What the 
processor will do! Flush the application? advertise the user?


[Martin:
********
I think part of your problem is that we don't have any idea yet about
what a fatal error means in the context of an XML browser. I think
this could be a very interesting topic for you. The reason the
XML spec is in a sense so vague here is that it has to work for
all kinds of processors.

This is not a problem with character encoding, but much more
general. What would you expect a browser to do when it
sees something like
<i>...<b>...</i>...</b>..., obviously not well-formed XML ?
******]
-- 
Karl Dubost / W3C - Conformance Manager
           http://www.w3.org/

      --- Be Strict To Be Cool! ---
Received on Tuesday, 16 January 2001 04:49:33 UTC