Re: RFC: White Space Handling In XML Parsing

Joel,

Thanks for the review, this one proves very useful.

I will go over it in detail and either respond to specific issues, or
amend the RFC. My goal is to come up with an RFC that is compliant with
the XML REC and does not confuse users even further.

Before responding to you, I would take the time to post a needed
explanation on the www-dom mailing list as to the user demands that
compelled me to write this RFC. I am trying to distinguish between the
purpose and use of this RFC in certain applications -- where I believe
it is long called for -- and between its compliance with the XML REC,
which I am trying to assure.

Arkin


"Joel A. Nava" wrote:
> 
> Arkin,
> 
> I read with great interest the RFC that you
> have authored on "White Space Handling In
> XML Parsing." It has taken me a little while
> to get this response back to you, but I wanted
> to get everything right. If I have misunderstood
> anything in my analysis, please let me know. As
> one of the members of the working group which
> produced the XML 1.0 Recommendation, I can recall
> many discussions we had on white space handling,
> and believe the I correctly remember the intent
> behind the specification of white space handling
> in the XML 1.0 REC.
> 
> Here is my analysis of:
> http://www.openxml.org/dev/rfc-wshp.html
> 
> Note: to start with, the RFC, in Section 3, defines an
> XML Parser as something that takes an XML document,
> and delivers a DOM Tree. It defines an XML Application
> as something which manipulates the DOM Tree.  Since I
> am so used to how these terms are used in the XML REC,
> when I quote from the RFC, I have taken the liberty of
> re-writing those portions in the XML 1.0 terminology.
> I hope that this does not make my review to hard to
> follow. Section 3 accurately describes the terms from
> the XML REC perspective, except for this: Depending
> on your viewpoint an XML processor that builds a tree
> can be seen as an XML parser, or as an XML parser +
> a tree building application. I do not believe that this
> difference in allowable perspective, changes the problem
> or the solution in any appreciable way.
> 
> So, the abstract would then read:
> 
>    White space handling is an unresolved issue
>    in the present definition of XML parsers and
>    DOM tree builders, falling outside the scope
>    of both the DOM specification and the SAX API.
>    This is a recommendation for the behavior of XML
>    parsers and DOM Tree building applications in
>    regards to white space appearing in the DOM Tree,
>    and what portions are to be delivered to an
>    application accessing the DOM tree.
> 
> I agree that whitespace handling issues do indeed
> fall outside of the DOM REC and the SAX API, since
> these documents do not describe white space handling.
> They defer to what the XML REC has to say.
> 
> My summary of the problem described is: detecting the
> difference between significant white space, and white
> space just used for pretty XML in a text editor, or
> insignificant white space..
> 
> Scope: Whitespace handling in element or mixed
> content only, not markup or attribute values.
> The spec is not useful for applications that
> want to process redundant white space, or XSL,
> XQL, or other processing languages.
> 
> I agree that the problem that the RFC describes is
> a problem, and it does fall within the scope as
> stated.
> 
> Assumption: the document itself is capable of
> distinguishing between relevant and redundant
> white space.
> 
> You have to make this assumption, or the problem
> is insoluble.
> 
> Goal: Consistent white space in the DOM calls
> on the same document by different parser + DOM
> Tree builder applications.
> 
> Since the W3C DOM does not make any changes to
> the whitespace handling rules, we can think
> of an XML processor that parses and builds a
> W3C DOM tree as an XML processor. The same can
> be said of any Document Model that does not
> change the whitespace handling rules. When a
> Document Model does define different white
> space handling rules, then we must view the
> parser as the XML processor, and the tree
> builder as an application, in XML REC terms.
> 
> The document next defines default behavior, and
> then alternate behavior when the xml:space attribute
> is in use.
> 
> For reference, the XML 1.0 Spec says on White space
> handling in the second paragraph of section 2.10:
> 
>    An XML processor must always pass all characters
>    in a document that are not markup through to the
>    application. A validating XML processor must also
>    inform the application which of these characters
>    constitute white space appearing in element content.
> 
> Also Section 3.2.1 can be paraphrased to say:
> 
>    Valid Element Content elements can have optional
>    white space between pairs of child elements.
> 
> The Document's Default rules follow with my comments in []:
> 
> 1) The first sequence of white space immediately after
>    the opening tag and the last sequence of white space
>    immediately before the closing tag are ignored.
> 
> [ This may violate what the user expects of their white
> space in mixed content, though this rule could be part
> of the application's default behavior. Such as when the
> application is a tree builder, that defines new white
> space handling rules, not W3C DOM compliant.]
> 
> 2) All non-space characters (tab and new-line) are
>    translated into a space character, and all multiple
>    space characters are consolidated into a single space.
> 
> [ Same as 1.]
> 
> 3) Sequence of white space occurring between any two
>    markups (elements, comments, processing instructions,
>    CDATA) except when appearing between two elements, is
>    ignored.
> 
> [Strictly speaking, this is not in harmony with the XML
> REC, but this could be defined by an application, as described
> in section 1. On the other hand this does not violate the
> spirit of the XML REC, and a conformant parsers and W3C DOM
> tree builder may in fact behave this way. So this is not a
> problem.]
> 
> 4) Sequence of white space occurring between two elements
>    is ignored if the element is defined to have element
>    content. If the element is defined to have mixed content,
>    such white space is treated according to the first two rules.
> 
> [Same as 1 and 2 in the case of mixed content. Also, this leads
> to the requirement that Well Formedness processors must also
> report insignificant white space in element content, which they
> are not currently required to do but can if they want to, just
> like Validating parsers.]
> 
> 5) White space introduced through expansion of character
>    references (e.g.  ) or entity references is preserved,
>    and not considered white space per the above rules. However,
>    white space appearing in the entity declaration is subject
>    to the parsing rules at the time of parsing the entity
>    declaration.
> 
> [Nothing in the XML REC indicates that a processor would signal
> that it has gotten white space from an NCR or a parsed entity.
> So, this is an additional requirement for XML parsers. It also
> does not seem to me be in line with the spirit of the XML REC.
> The white space passed on from NCR or entity expansion falls
> under the same rules as if the contents of the NCRs or entities
> had just been written in place.
>  I am guessing that the second sentence is indicating conformance
> with section 4.5 "Construction of Internal Entity Replacement
> Text" and appendix D "Expansion of Entity and Character
> References" in the XML REC. A conforming XML processor must
> currently follow this rule.]
> 
> 6) CDATA sections preserve all white space occurring between the
>    opening <![CDATA[ and closing ]]>.
> 
> [This is what the XML REC requires.]
> 
> Now we get to the rules to follow if xml:space is in use:
> 
> For reference the relevant part of the XML REC:
> 
>    A special attribute named xml:space may be attached to
>    an element to signal an intention that in that element,
>    white space should be preserved by applications. In valid
>    documents, this attribute, like any other, must be declared
>    if it is used. When declared, it must be given as an
>    enumerated type whose only possible values are "default"
>    and "preserve".
> 
>    The value "default" signals that applications' default
>    white-space processing modes are acceptable for this element;
>    the value "preserve" indicates the intent that applications
>    preserve all the white space. This declared intent is
>    considered to apply to all elements within the content of
>    the element where it is specified, unless overridden with
>    another instance of the xml:space attribute.
> 
>    The root element of any document is considered to have
>    signaled no intentions as regards application space handling,
>    unless it provides a value for this attribute or the
>    attribute is declared with a default value.
> 
> The alternate rules that are in use when xml:space is in use
> follow, again with my comments in []:
> 
> 1) An element requests that white space be preserved by
>    specifying the attribute 'xml:space' and using the value
>    'preserve'. The element may specify this attribute explicitly
>    or inherit it from the document type definition. It is
>    recommended that elements specify this attribute explicitly.
> 
> [This is what the XML REC requires.]
> 
> 2) Preserving implies that white space is passed as is to the
>    application, without any transformation of loss, with the
>    exception that, if the first character after the opening
>    tag is a new-line or the last character before the closing
>    tag is a new-line, they are ignored.
> 
> [The RFC previously acknowledged the need to follow the line
> end normalization process as specified in the XML REC. So, all
> of (2) is what the XML REC requires.]
> 
> 3) Elements that do not specify a value for the 'xml:space'
>    attribute inherit that value from the element in which
>    they are contained up to the root element. If the root
>    element does not specify a value for the 'xml:space'
>    attribute, the value 'default' is assumed.
> 
> [This is what the XML REC requires.]
> 
> 4) It is possible to instruct the XML parser to supply the
>    root element with the 'preserve' value for the 'xml:space'
>    attribute, if no value is explicitly specified for it.
>    (The exact mechanism to TBD)
> 
> [This is an additional requirement on an XML parser not
> contained in the XML 1.0 REC. If the parser was wrapped in an
> application though, this could be legal, the application
> could go in and make sure that xml:space='preserve' was
> applicable to the root element, whether explicitly putting
> this on the root element, or adding a default ATTLIST
> declaration for the root element.]
> 
> 5) When expanding an entity reference, the value of the
>    'xml:space' attribute of the element in which the entity
>    is expanded has no affect on the expansion of the entity.
> 
> [Huh? xml:space values are just passed on by the parser to
> the application. They can have nothing to do with entity
> expansion. Unless this is saying that the contents of the
> entity are not subject to the xml:space attribute in scope
> at the reference point. That would be in violation of the
> XML REC.]
> 
> The last paragraph points out a problem alluded to earlier:
> 
>   This approach is clear and consistent, with the exception
>   that a validating and non-validating parsers will parse
>   the same document differently.
> 
> My take on this problem:
> 
> I think the difference is the required reporting and possible
> non-reporting of insignificant white space due to element
> content by validating and non-validating parsers respectively.
> 
> I believe that it is a mistake that the XML 1.0 REC requires
> 
> 1) Only a Validating processor to indicate the insignificant
> white space.
> 
> While...
> 
> 2) Acknowledging that the declaration of an element type
> with element content, where white space occurs directly within
> any instance of that element, changes the Information Set.
> 
> A user can correctly say standalone=yes, and still get a
> different Information Set from the 2 classes of processors.
> 
> Because of this, and other document Information Set differences
> that can occur between a minimal Well Formedness processor,
> and a Validating processor, I have made the following proposal
> for future work on XML. Since it was my proposal only, I can
> share it here on a public list. This does not imply anything
> about whether this proposal will be adopted.
> 
> ==================================================================
> An XML Full Information Set Processor
> 
> Proposed: Define a new class of XML processor that exists
> in the currently optional area in XML 1.0 between Validating
> XML processors, and minimally conforming XML processors. This
> processor will be required to use all the data made available
> to it to build the complete Information Set of documents that
> it reads. That means that it has to read and expand all external
> entities, read and use an external subset if declared, and expand
> all external parameter entities for markup.
> 
> Creators of XML that wish to use large external DTDs will not
> have to shove a load of markup into the internal subset of
> every document that they transmit so that the information
> set received by a minimally conforming XML processor will
> be complete.
> 
> It may be pointed out that a validating XML processor can
> provide the same information set, as the proposed processor.
> Counter arguments are:
> 
> 1) The author may not care about validation, just the
> information set.
> 
> 2) The document may not be valid, especially when documents
> are authored that mix namespaces, and especially since
> validation has not been or may not be defined for mixed
> namespaces.
> 
> 3) Validation may be much more costly when done using XML
> Schemas: in processing time, in processor footprint, and
> in the work needed to create a validating processor.
> 
> There are no conflicts with existing XML documents, and the
> proposal should be very easy to adapt to the XML Schema
> work, when it is done. The proposal does not change XML's
> conformance to ISO 8879.
> ==================================================================
> 
> CONCLUSION
> 
> The RFC is free to define what an XML application does with
> information that an XML processor passes to it. But it is not
> a good idea to violate the spirit of the XML REC. This would
> be confusing to the marketplace. The RFC shows a very real
> problem in the XML 1.0 REC, and begs a fix that would require
> XML parsers to always report white space in element content.
> In the mean time before this is fixed, or something like my
> proposal above is adopted, I think it would be good for the
> RFC to require that XML parsers that are in conformance with
> it report white space in element content, whether validating
> or not. Most non-validating parsers written these days tend
> to do more than just the minimum required, and quite a few
> pass all of the Information Set of the document on, even when
> not validating.
> 
> I hope that this review has been of some value.
> 
> --
> Joel A. Nava                  (408)536-6209
> Adobe Systems, Inc.         jnava@adobe.com

Received on Monday, 17 May 1999 10:07:11 UTC