- From: Karl Dubost <karl@w3.org>
- Date: Tue, 16 Jan 2001 10:46:11 +0100
- To: xml-editor@w3.org
- Cc: duerst@w3.org, hugo@w3.org
In the process of writing a note, we (Hugo Haas and Karl Dubost) were
faced to a problem about behavior of a browser when there is an
encoding character error inside an XML document.
For HTML the sequence is well defined but in XML it's unclear.
I send you the mail we have discussed on this topic, plus the
comments from Martin Duerst in [Martin: ....]. I would know if you
have some comments to this issue, if it's an issue. Thanks.
********************************************
1. For HTML, it's quite easy.
http://www.w3.org/TR/html401/charset.html
It's very clear http://www.w3.org/TR/html401/charset.html#h-5.2.2
"How does a user agent know which character encoding has been used?"
1.1 "The server should provide this information". -> HTTP Header.
1.2 else if , "Therefore, user agents must not assume any default
value for the "charset" parameter."
1.3 But "HTML documents may include explicit information about the
document's character encoding" -> META
1.4 else if, "For cases where neither the HTTP protocol nor the META
element provides information about the character encoding of a
document, HTML also provides the charset attribute on several
elements. "
1.5 else if, "the user agent may use heuristics and user settings."
1.6 else if, "User agents may provide a mechanism that allows users
to override incorrect "charset" information."
As you can read, there a suite of Test and conditions with priorities order.
2. For XML, it's not the same. (when you read the spec). It's
difficult to see if it's a bug or not in the spec or a point that is
ambiguous.
Character encoding http://www.w3.org/TR/REC-xml#charencoding
2.1 XML processor (user agent could be an XML processor) must read
UTF-8 and UTF-16
-> ok
2.2 "Although an XML processor is required to read only entities in
the UTF-8 and UTF-16 encodings, it is recognized that other encodings
are used around the world, and it may be desired for XML processors
to read entities that use them."
-> ok
2.3 "In the absence of external character encoding information (such
as MIME headers), parsed entities which are stored in an encoding
other than UTF-8 or UTF-16 must begin with a text declaration"
-> for example <?xml version="1.0" encoding="iso-8859-1"?>
-> ok
2.4 "In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME), it is an error for an entity including
an encoding declaration to be presented to the XML processor in an
encoding other than that named in the declaration, or for an entity
which begins with neither a Byte Order Mark nor an encoding
declaration to use an encoding other than UTF-8."
-> error = A violation of the rules of this specification;
results are undefined. Conforming software may detect and report an
error and may recover from it. (http://www.w3.org/TR/REC-xml#dt-error)
-> results are undefined, may recover but how ?????
[Martin:
********
This is indeed a bit strange, in that the definition for 'error' implies
that it's an error in the XML document, whereas the text above speaks
about an error (i.e. erroneous behaviour) of the XML processor.
Maybe it would be better to reword this as 'an XML processor is not
allowed to...', because if the XML processor is already committing the
error, then there is probably no need for additional recovery on top
of that.
I think it might be a good idea to send this one to the xml-editor list,
can you please do that?
******]
and a paragraph later
2.5 "It is a fatal error when an XML processor encounters an entity
with an encoding that it is unable to process. It is a fatal error if
an XML entity is determined (via default, encoding declaration, or
higher-level protocol) to be in a certain encoding but contains octet
sequences that are not legal in that encoding. It is also a fatal
error if an XML entity contains no encoding declaration and its
content is not legal UTF-8 or UTF-16."
-> fatal error = "An error which a conforming XML processor
must detect and report to the application. After encountering a fatal
error, the processor may continue processing the data to search for
further errors and may report such errors to the application. In
order to support correction of errors, the processor may make
unprocessed data from the document (with intermingled character data
and markup) available to the application. Once a fatal error is
detected, however, the processor must not continue normal processing
(i.e., it must not continue to pass character data and information
about the document's logical structure to the application in the
normal way)"
-> ok the processor *** must not *** continue. What the
processor will do! Flush the application? advertise the user?
[Martin:
********
I think part of your problem is that we don't have any idea yet about
what a fatal error means in the context of an XML browser. I think
this could be a very interesting topic for you. The reason the
XML spec is in a sense so vague here is that it has to work for
all kinds of processors.
This is not a problem with character encoding, but much more
general. What would you expect a browser to do when it
sees something like
<i>...<b>...</i>...</b>..., obviously not well-formed XML ?
******]
--
Karl Dubost / W3C - Conformance Manager
http://www.w3.org/
--- Be Strict To Be Cool! ---
Received on Tuesday, 16 January 2001 04:49:33 UTC