- From: François Yergeau <francois@yergeau.com>
- Date: Wed, 20 Oct 2004 15:05:12 -0400
- To: public-xml-core-wg@w3.org
Paul Grosso a écrit : > CONSENSUS to explain that the paragraph about "characters #x85 > and #x2028" is only about the XML declaration, and the problem > is that one hasn't yet processed the encoding declaration, so > we don't want to complicate issues by allowing these characters > here. CONSENSUS to make no change to the spec. > > ACTION to Francois: Write a proposed response and put into countdown. Done. Here's the proposed resolution: No change to the spec. Respond to the commenter's 3 questions as follows: Q: Why is it (in theory) not possible to recognize these characters reliably or (in theory) with less reliability than recognizing any other character such as U+0020? A: Appendix E explains how to determine the encoding by first using the first few bytes of an entity to determine an encoding family (e.g. ASCII-based, EBCDIC-based, 16- or 32-bit Unicode) and then to use that information to analyse the encoding declaration and determine the encoding fully. It turns out that knowing the encoding family is sufficient to reliably recognize U+0020 SPACE as well as most ASCII characters, but is not sufficient to do the same for NEL (U+0085) and U+2028. To ensure that the encoding declaration can be analysed and the encoding reliably determined, U+0085 and U+2028 were therefore forbidden from appearing in the XML/text declaration. Q: How can a processor detect this error if it is not possible to recognize the offending characters reliably? A: Is it possible to recognize those characters reliably after determining the encoding. If the encoding is determined from information provided by an external transport protocol, then parsing can proceed and the presence of U+0085 or U+2028 can immediately be detected and flagged as a fatal error. If the processor is relying on analysis of the encoding declaration to determine the encoding, then this analysis will likely fail in the presence of U+0085 or U+2028, preventing the processor from continuing. Q: How can a processor detect this error if it is not possible that these characters are present when parsing the XML declaration due to line break normalization? A: The statement in 2.11 "...it is a fatal error to use [U+0085 or U+2028] within the XML declaration or text declaration" creates an obligation for the processor to detect them before performing line break normalization. Yes, this is somewhat ugly since a layer of processing that could be entirely context-free now depends on the state of the parser (is within an XML/text declaration, or not). >>[7] >>http://www.w3.org/XML/2004/02/proposed-xml10-3e-and-xml11-errata.html -- François
Received on Wednesday, 20 October 2004 19:05:58 UTC