- From: Joel A. Nava <jnava@Adobe.COM>
- Date: Fri, 14 May 1999 19:06:38 -0400 (EDT)
- To: <www-dom@w3.org>
- Cc: "Joel A. Nava" <jnava@Adobe.COM>
Arkin, I read with great interest the RFC that you have authored on "White Space Handling In XML Parsing." It has taken me a little while to get this response back to you, but I wanted to get everything right. If I have misunderstood anything in my analysis, please let me know. As one of the members of the working group which produced the XML 1.0 Recommendation, I can recall many discussions we had on white space handling, and believe the I correctly remember the intent behind the specification of white space handling in the XML 1.0 REC. Here is my analysis of: http://www.openxml.org/dev/rfc-wshp.html Note: to start with, the RFC, in Section 3, defines an XML Parser as something that takes an XML document, and delivers a DOM Tree. It defines an XML Application as something which manipulates the DOM Tree. Since I am so used to how these terms are used in the XML REC, when I quote from the RFC, I have taken the liberty of re-writing those portions in the XML 1.0 terminology. I hope that this does not make my review to hard to follow. Section 3 accurately describes the terms from the XML REC perspective, except for this: Depending on your viewpoint an XML processor that builds a tree can be seen as an XML parser, or as an XML parser + a tree building application. I do not believe that this difference in allowable perspective, changes the problem or the solution in any appreciable way. So, the abstract would then read: White space handling is an unresolved issue in the present definition of XML parsers and DOM tree builders, falling outside the scope of both the DOM specification and the SAX API. This is a recommendation for the behavior of XML parsers and DOM Tree building applications in regards to white space appearing in the DOM Tree, and what portions are to be delivered to an application accessing the DOM tree. I agree that whitespace handling issues do indeed fall outside of the DOM REC and the SAX API, since these documents do not describe white space handling. They defer to what the XML REC has to say. My summary of the problem described is: detecting the difference between significant white space, and white space just used for pretty XML in a text editor, or insignificant white space.. Scope: Whitespace handling in element or mixed content only, not markup or attribute values. The spec is not useful for applications that want to process redundant white space, or XSL, XQL, or other processing languages. I agree that the problem that the RFC describes is a problem, and it does fall within the scope as stated. Assumption: the document itself is capable of distinguishing between relevant and redundant white space. You have to make this assumption, or the problem is insoluble. Goal: Consistent white space in the DOM calls on the same document by different parser + DOM Tree builder applications. Since the W3C DOM does not make any changes to the whitespace handling rules, we can think of an XML processor that parses and builds a W3C DOM tree as an XML processor. The same can be said of any Document Model that does not change the whitespace handling rules. When a Document Model does define different white space handling rules, then we must view the parser as the XML processor, and the tree builder as an application, in XML REC terms. The document next defines default behavior, and then alternate behavior when the xml:space attribute is in use. For reference, the XML 1.0 Spec says on White space handling in the second paragraph of section 2.10: An XML processor must always pass all characters in a document that are not markup through to the application. A validating XML processor must also inform the application which of these characters constitute white space appearing in element content. Also Section 3.2.1 can be paraphrased to say: Valid Element Content elements can have optional white space between pairs of child elements. The Document's Default rules follow with my comments in []: 1) The first sequence of white space immediately after the opening tag and the last sequence of white space immediately before the closing tag are ignored. [ This may violate what the user expects of their white space in mixed content, though this rule could be part of the application's default behavior. Such as when the application is a tree builder, that defines new white space handling rules, not W3C DOM compliant.] 2) All non-space characters (tab and new-line) are translated into a space character, and all multiple space characters are consolidated into a single space. [ Same as 1.] 3) Sequence of white space occurring between any two markups (elements, comments, processing instructions, CDATA) except when appearing between two elements, is ignored. [Strictly speaking, this is not in harmony with the XML REC, but this could be defined by an application, as described in section 1. On the other hand this does not violate the spirit of the XML REC, and a conformant parsers and W3C DOM tree builder may in fact behave this way. So this is not a problem.] 4) Sequence of white space occurring between two elements is ignored if the element is defined to have element content. If the element is defined to have mixed content, such white space is treated according to the first two rules. [Same as 1 and 2 in the case of mixed content. Also, this leads to the requirement that Well Formedness processors must also report insignificant white space in element content, which they are not currently required to do but can if they want to, just like Validating parsers.] 5) White space introduced through expansion of character references (e.g.  ) or entity references is preserved, and not considered white space per the above rules. However, white space appearing in the entity declaration is subject to the parsing rules at the time of parsing the entity declaration. [Nothing in the XML REC indicates that a processor would signal that it has gotten white space from an NCR or a parsed entity. So, this is an additional requirement for XML parsers. It also does not seem to me be in line with the spirit of the XML REC. The white space passed on from NCR or entity expansion falls under the same rules as if the contents of the NCRs or entities had just been written in place. I am guessing that the second sentence is indicating conformance with section 4.5 "Construction of Internal Entity Replacement Text" and appendix D "Expansion of Entity and Character References" in the XML REC. A conforming XML processor must currently follow this rule.] 6) CDATA sections preserve all white space occurring between the opening <![CDATA[ and closing ]]>. [This is what the XML REC requires.] Now we get to the rules to follow if xml:space is in use: For reference the relevant part of the XML REC: A special attribute named xml:space may be attached to an element to signal an intention that in that element, white space should be preserved by applications. In valid documents, this attribute, like any other, must be declared if it is used. When declared, it must be given as an enumerated type whose only possible values are "default" and "preserve". The value "default" signals that applications' default white-space processing modes are acceptable for this element; the value "preserve" indicates the intent that applications preserve all the white space. This declared intent is considered to apply to all elements within the content of the element where it is specified, unless overridden with another instance of the xml:space attribute. The root element of any document is considered to have signaled no intentions as regards application space handling, unless it provides a value for this attribute or the attribute is declared with a default value. The alternate rules that are in use when xml:space is in use follow, again with my comments in []: 1) An element requests that white space be preserved by specifying the attribute 'xml:space' and using the value 'preserve'. The element may specify this attribute explicitly or inherit it from the document type definition. It is recommended that elements specify this attribute explicitly. [This is what the XML REC requires.] 2) Preserving implies that white space is passed as is to the application, without any transformation of loss, with the exception that, if the first character after the opening tag is a new-line or the last character before the closing tag is a new-line, they are ignored. [The RFC previously acknowledged the need to follow the line end normalization process as specified in the XML REC. So, all of (2) is what the XML REC requires.] 3) Elements that do not specify a value for the 'xml:space' attribute inherit that value from the element in which they are contained up to the root element. If the root element does not specify a value for the 'xml:space' attribute, the value 'default' is assumed. [This is what the XML REC requires.] 4) It is possible to instruct the XML parser to supply the root element with the 'preserve' value for the 'xml:space' attribute, if no value is explicitly specified for it. (The exact mechanism to TBD) [This is an additional requirement on an XML parser not contained in the XML 1.0 REC. If the parser was wrapped in an application though, this could be legal, the application could go in and make sure that xml:space='preserve' was applicable to the root element, whether explicitly putting this on the root element, or adding a default ATTLIST declaration for the root element.] 5) When expanding an entity reference, the value of the 'xml:space' attribute of the element in which the entity is expanded has no affect on the expansion of the entity. [Huh? xml:space values are just passed on by the parser to the application. They can have nothing to do with entity expansion. Unless this is saying that the contents of the entity are not subject to the xml:space attribute in scope at the reference point. That would be in violation of the XML REC.] The last paragraph points out a problem alluded to earlier: This approach is clear and consistent, with the exception that a validating and non-validating parsers will parse the same document differently. My take on this problem: I think the difference is the required reporting and possible non-reporting of insignificant white space due to element content by validating and non-validating parsers respectively. I believe that it is a mistake that the XML 1.0 REC requires 1) Only a Validating processor to indicate the insignificant white space. While... 2) Acknowledging that the declaration of an element type with element content, where white space occurs directly within any instance of that element, changes the Information Set. A user can correctly say standalone=yes, and still get a different Information Set from the 2 classes of processors. Because of this, and other document Information Set differences that can occur between a minimal Well Formedness processor, and a Validating processor, I have made the following proposal for future work on XML. Since it was my proposal only, I can share it here on a public list. This does not imply anything about whether this proposal will be adopted. ================================================================== An XML Full Information Set Processor Proposed: Define a new class of XML processor that exists in the currently optional area in XML 1.0 between Validating XML processors, and minimally conforming XML processors. This processor will be required to use all the data made available to it to build the complete Information Set of documents that it reads. That means that it has to read and expand all external entities, read and use an external subset if declared, and expand all external parameter entities for markup. Creators of XML that wish to use large external DTDs will not have to shove a load of markup into the internal subset of every document that they transmit so that the information set received by a minimally conforming XML processor will be complete. It may be pointed out that a validating XML processor can provide the same information set, as the proposed processor. Counter arguments are: 1) The author may not care about validation, just the information set. 2) The document may not be valid, especially when documents are authored that mix namespaces, and especially since validation has not been or may not be defined for mixed namespaces. 3) Validation may be much more costly when done using XML Schemas: in processing time, in processor footprint, and in the work needed to create a validating processor. There are no conflicts with existing XML documents, and the proposal should be very easy to adapt to the XML Schema work, when it is done. The proposal does not change XML's conformance to ISO 8879. ================================================================== CONCLUSION The RFC is free to define what an XML application does with information that an XML processor passes to it. But it is not a good idea to violate the spirit of the XML REC. This would be confusing to the marketplace. The RFC shows a very real problem in the XML 1.0 REC, and begs a fix that would require XML parsers to always report white space in element content. In the mean time before this is fixed, or something like my proposal above is adopted, I think it would be good for the RFC to require that XML parsers that are in conformance with it report white space in element content, whether validating or not. Most non-validating parsers written these days tend to do more than just the minimum required, and quite a few pass all of the Information Set of the document on, even when not validating. I hope that this review has been of some value. -- Joel A. Nava (408)536-6209 Adobe Systems, Inc. jnava@adobe.com
Received on Monday, 17 May 1999 08:16:32 UTC