PE134 from François Yergeau on 2004-10-20 (public-xml-core-wg@w3.org from October 2004)

From: François Yergeau <francois@yergeau.com>
Date: Wed, 20 Oct 2004 15:05:12 -0400
To: public-xml-core-wg@w3.org
Message-id: <4176B6E8.4040601@yergeau.com>
Paul Grosso a écrit :
> CONSENSUS to explain that the paragraph about "characters #x85 
> and #x2028" is only about the XML declaration, and the problem
> is that one hasn't yet processed the encoding declaration, so
> we don't want to complicate issues by allowing these characters 
> here.  CONSENSUS to make no change to the spec.
> 
> ACTION to Francois:  Write a proposed response and put into countdown.

Done.  Here's the proposed resolution:

No change to the spec. Respond to the commenter's 3 questions as follows:

Q: Why is it (in theory) not possible to recognize these characters 
reliably or (in theory) with less reliability than recognizing any other 
character such as U+0020?

A: Appendix E explains how to determine the encoding by first using the 
first few bytes of an entity to determine an encoding family (e.g. 
ASCII-based, EBCDIC-based, 16- or 32-bit Unicode) and then to use that 
information to analyse the encoding declaration and determine the 
encoding fully. It turns out that knowing the encoding family is 
sufficient to reliably recognize U+0020 SPACE as well as most ASCII 
characters, but is not sufficient to do the same for NEL (U+0085) and 
U+2028. To ensure that the encoding declaration can be analysed and the 
encoding reliably determined, U+0085 and U+2028 were therefore forbidden 
from appearing in the XML/text declaration.

Q: How can a processor detect this error if it is not possible to 
recognize the offending characters reliably?

A: Is it possible to recognize those characters reliably after 
determining the encoding. If the encoding is determined from information 
provided by an external transport protocol, then parsing can proceed and 
the presence of U+0085 or U+2028 can immediately be detected and flagged 
as a fatal error. If the processor is relying on analysis of the 
encoding declaration to determine the encoding, then this analysis will 
likely fail in the presence of U+0085 or U+2028, preventing the 
processor from continuing.

Q: How can a processor detect this error if it is not possible that 
these characters are present when parsing the XML declaration due to 
line break normalization?

A: The statement in 2.11 "...it is a fatal error to use [U+0085 or 
U+2028] within the XML declaration or text declaration" creates an 
obligation for the processor to detect them before performing line break 
normalization. Yes, this is somewhat ugly since a layer of processing 
that could be entirely context-free now depends on the state of the 
parser (is within an XML/text declaration, or not).


>>[7]
>>http://www.w3.org/XML/2004/02/proposed-xml10-3e-and-xml11-errata.html

-- 
François
Received on Wednesday, 20 October 2004 19:05:58 UTC