Re: RFC: White Space Handling In XML Parsing

Arkin,

I read with great interest the RFC that you
have authored on "White Space Handling In
XML Parsing." It has taken me a little while
to get this response back to you, but I wanted
to get everything right. If I have misunderstood
anything in my analysis, please let me know. As
one of the members of the working group which
produced the XML 1.0 Recommendation, I can recall
many discussions we had on white space handling,
and believe the I correctly remember the intent
behind the specification of white space handling
in the XML 1.0 REC.

Here is my analysis of:
http://www.openxml.org/dev/rfc-wshp.html

Note: to start with, the RFC, in Section 3, defines an
XML Parser as something that takes an XML document,
and delivers a DOM Tree. It defines an XML Application
as something which manipulates the DOM Tree.  Since I
am so used to how these terms are used in the XML REC,
when I quote from the RFC, I have taken the liberty of
re-writing those portions in the XML 1.0 terminology.
I hope that this does not make my review to hard to
follow. Section 3 accurately describes the terms from
the XML REC perspective, except for this: Depending
on your viewpoint an XML processor that builds a tree
can be seen as an XML parser, or as an XML parser +
a tree building application. I do not believe that this
difference in allowable perspective, changes the problem
or the solution in any appreciable way.

So, the abstract would then read:

   White space handling is an unresolved issue
   in the present definition of XML parsers and
   DOM tree builders, falling outside the scope
   of both the DOM specification and the SAX API.
   This is a recommendation for the behavior of XML
   parsers and DOM Tree building applications in
   regards to white space appearing in the DOM Tree,
   and what portions are to be delivered to an
   application accessing the DOM tree.

I agree that whitespace handling issues do indeed
fall outside of the DOM REC and the SAX API, since
these documents do not describe white space handling.
They defer to what the XML REC has to say.

My summary of the problem described is: detecting the
difference between significant white space, and white
space just used for pretty XML in a text editor, or
insignificant white space..

Scope: Whitespace handling in element or mixed
content only, not markup or attribute values.
The spec is not useful for applications that
want to process redundant white space, or XSL,
XQL, or other processing languages.

I agree that the problem that the RFC describes is
a problem, and it does fall within the scope as
stated.

Assumption: the document itself is capable of
distinguishing between relevant and redundant
white space.

You have to make this assumption, or the problem
is insoluble.

Goal: Consistent white space in the DOM calls
on the same document by different parser + DOM
Tree builder applications.

Since the W3C DOM does not make any changes to
the whitespace handling rules, we can think
of an XML processor that parses and builds a
W3C DOM tree as an XML processor. The same can
be said of any Document Model that does not
change the whitespace handling rules. When a
Document Model does define different white
space handling rules, then we must view the
parser as the XML processor, and the tree
builder as an application, in XML REC terms.

The document next defines default behavior, and
then alternate behavior when the xml:space attribute
is in use.

For reference, the XML 1.0 Spec says on White space
handling in the second paragraph of section 2.10:

   An XML processor must always pass all characters
   in a document that are not markup through to the
   application. A validating XML processor must also
   inform the application which of these characters
   constitute white space appearing in element content.

Also Section 3.2.1 can be paraphrased to say:

   Valid Element Content elements can have optional
   white space between pairs of child elements.

The Document's Default rules follow with my comments in []:

1) The first sequence of white space immediately after
   the opening tag and the last sequence of white space
   immediately before the closing tag are ignored.

[ This may violate what the user expects of their white
space in mixed content, though this rule could be part
of the application's default behavior. Such as when the
application is a tree builder, that defines new white
space handling rules, not W3C DOM compliant.]

2) All non-space characters (tab and new-line) are
   translated into a space character, and all multiple
   space characters are consolidated into a single space.

[ Same as 1.]

3) Sequence of white space occurring between any two
   markups (elements, comments, processing instructions,
   CDATA) except when appearing between two elements, is
   ignored.

[Strictly speaking, this is not in harmony with the XML
REC, but this could be defined by an application, as described
in section 1. On the other hand this does not violate the
spirit of the XML REC, and a conformant parsers and W3C DOM
tree builder may in fact behave this way. So this is not a
problem.]

4) Sequence of white space occurring between two elements
   is ignored if the element is defined to have element
   content. If the element is defined to have mixed content,
   such white space is treated according to the first two rules.

[Same as 1 and 2 in the case of mixed content. Also, this leads
to the requirement that Well Formedness processors must also
report insignificant white space in element content, which they
are not currently required to do but can if they want to, just
like Validating parsers.]

5) White space introduced through expansion of character
   references (e.g.  ) or entity references is preserved,
   and not considered white space per the above rules. However,
   white space appearing in the entity declaration is subject
   to the parsing rules at the time of parsing the entity
   declaration.

[Nothing in the XML REC indicates that a processor would signal
that it has gotten white space from an NCR or a parsed entity.
So, this is an additional requirement for XML parsers. It also
does not seem to me be in line with the spirit of the XML REC.
The white space passed on from NCR or entity expansion falls
under the same rules as if the contents of the NCRs or entities
had just been written in place.
 I am guessing that the second sentence is indicating conformance
with section 4.5 "Construction of Internal Entity Replacement
Text" and appendix D "Expansion of Entity and Character
References" in the XML REC. A conforming XML processor must
currently follow this rule.]

6) CDATA sections preserve all white space occurring between the
   opening <![CDATA[ and closing ]]>.

[This is what the XML REC requires.]


Now we get to the rules to follow if xml:space is in use:

For reference the relevant part of the XML REC:

   A special attribute named xml:space may be attached to
   an element to signal an intention that in that element,
   white space should be preserved by applications. In valid
   documents, this attribute, like any other, must be declared
   if it is used. When declared, it must be given as an
   enumerated type whose only possible values are "default"
   and "preserve".

   The value "default" signals that applications' default
   white-space processing modes are acceptable for this element;
   the value "preserve" indicates the intent that applications
   preserve all the white space. This declared intent is
   considered to apply to all elements within the content of
   the element where it is specified, unless overridden with
   another instance of the xml:space attribute.

   The root element of any document is considered to have
   signaled no intentions as regards application space handling,
   unless it provides a value for this attribute or the
   attribute is declared with a default value.

The alternate rules that are in use when xml:space is in use
follow, again with my comments in []:

1) An element requests that white space be preserved by
   specifying the attribute 'xml:space' and using the value
   'preserve'. The element may specify this attribute explicitly
   or inherit it from the document type definition. It is
   recommended that elements specify this attribute explicitly.

[This is what the XML REC requires.]

2) Preserving implies that white space is passed as is to the
   application, without any transformation of loss, with the
   exception that, if the first character after the opening
   tag is a new-line or the last character before the closing
   tag is a new-line, they are ignored.

[The RFC previously acknowledged the need to follow the line
end normalization process as specified in the XML REC. So, all
of (2) is what the XML REC requires.]

3) Elements that do not specify a value for the 'xml:space'
   attribute inherit that value from the element in which
   they are contained up to the root element. If the root
   element does not specify a value for the 'xml:space'
   attribute, the value 'default' is assumed.

[This is what the XML REC requires.]

4) It is possible to instruct the XML parser to supply the
   root element with the 'preserve' value for the 'xml:space'
   attribute, if no value is explicitly specified for it.
   (The exact mechanism to TBD)

[This is an additional requirement on an XML parser not
contained in the XML 1.0 REC. If the parser was wrapped in an
application though, this could be legal, the application
could go in and make sure that xml:space='preserve' was
applicable to the root element, whether explicitly putting
this on the root element, or adding a default ATTLIST
declaration for the root element.]

5) When expanding an entity reference, the value of the
   'xml:space' attribute of the element in which the entity
   is expanded has no affect on the expansion of the entity.

[Huh? xml:space values are just passed on by the parser to
the application. They can have nothing to do with entity
expansion. Unless this is saying that the contents of the
entity are not subject to the xml:space attribute in scope
at the reference point. That would be in violation of the
XML REC.]

The last paragraph points out a problem alluded to earlier:

  This approach is clear and consistent, with the exception
  that a validating and non-validating parsers will parse
  the same document differently.

My take on this problem:

I think the difference is the required reporting and possible
non-reporting of insignificant white space due to element
content by validating and non-validating parsers respectively.

I believe that it is a mistake that the XML 1.0 REC requires

1) Only a Validating processor to indicate the insignificant
white space.

While...

2) Acknowledging that the declaration of an element type
with element content, where white space occurs directly within
any instance of that element, changes the Information Set.

A user can correctly say standalone=yes, and still get a
different Information Set from the 2 classes of processors.

Because of this, and other document Information Set differences
that can occur between a minimal Well Formedness processor,
and a Validating processor, I have made the following proposal
for future work on XML. Since it was my proposal only, I can
share it here on a public list. This does not imply anything
about whether this proposal will be adopted.

==================================================================
An XML Full Information Set Processor

Proposed: Define a new class of XML processor that exists
in the currently optional area in XML 1.0 between Validating
XML processors, and minimally conforming XML processors. This
processor will be required to use all the data made available
to it to build the complete Information Set of documents that
it reads. That means that it has to read and expand all external
entities, read and use an external subset if declared, and expand
all external parameter entities for markup.

Creators of XML that wish to use large external DTDs will not
have to shove a load of markup into the internal subset of
every document that they transmit so that the information
set received by a minimally conforming XML processor will
be complete.

It may be pointed out that a validating XML processor can
provide the same information set, as the proposed processor.
Counter arguments are:

1) The author may not care about validation, just the
information set.

2) The document may not be valid, especially when documents
are authored that mix namespaces, and especially since
validation has not been or may not be defined for mixed
namespaces.

3) Validation may be much more costly when done using XML
Schemas: in processing time, in processor footprint, and
in the work needed to create a validating processor.

There are no conflicts with existing XML documents, and the
proposal should be very easy to adapt to the XML Schema
work, when it is done. The proposal does not change XML's
conformance to ISO 8879.
==================================================================

CONCLUSION

The RFC is free to define what an XML application does with
information that an XML processor passes to it. But it is not
a good idea to violate the spirit of the XML REC. This would
be confusing to the marketplace. The RFC shows a very real
problem in the XML 1.0 REC, and begs a fix that would require
XML parsers to always report white space in element content.
In the mean time before this is fixed, or something like my
proposal above is adopted, I think it would be good for the
RFC to require that XML parsers that are in conformance with
it report white space in element content, whether validating
or not. Most non-validating parsers written these days tend
to do more than just the minimum required, and quite a few
pass all of the Information Set of the document on, even when
not validating.

I hope that this review has been of some value.

--
Joel A. Nava                  (408)536-6209
Adobe Systems, Inc.         jnava@adobe.com

Received on Monday, 17 May 1999 08:16:32 UTC