Clarification: RFC: White Space Handling In XML Parsing

I would like to clarify some of the rational behind the white space
handling RFC. While developing the OpenXML parser architecture I was
requested to supply two white space handling models for two distinct
groups of developers.

One group was working on XML technologies, such as XSL and XQL
processing, and was concerned that by handling white space the XML
parser might "hide" some important details from it. This class of
developers can get by with using the XML parser in a mode that does not
perform any white space handling. Many of the people posting on this
mailing list fall under this category.

The second and larger group were developers using XML to achieve their
application goals. They do not understand XML to its finest details and
do not care so much about the fine details of XML processing. For them,
the role of an XML parser is to provide a consistent information model
based on the contents of a textual file. And for them, I wrote an RFC
that would allow them to get this form of functionality with the least
of hassle.

Now, if you look at white space handling from the business application
perspective you realize that:

* 90% of the time white space is used to format the source document so
people can read it -- it's redundant

* 90% of the elements either do not contain text at all, or contain text
that is subject to white space stripping

* The remainder can be clearly expressed with the use of a DTD or
xml:space attribute

* Application developers and XML document writers tend to have some
experience with HTML, which clearly defines white space behavior (if
nothing else)

* A generic application is only interested in the actual information
model; if that information is encumbered with white space, the
application will have to spend some of its time getting rid of these
excessive white space

Application developers can write white space stripping code for each
application based on specific DTD; this is generally known as poor
engineering; or they can write generic stripping code; or they can get
that functionality from the XML parser.

As the current situation is, when I use an XML parser to read a book
catalog file, I get different document trees depending on the parser I
use. Sun ProjectX will tend to not normalize text nodes, IBM XML4J will
tend to produce excessive white space nodes, and OpenXML will tend to
kill too much white space. When I write an application with one parser,
I can be positive it will not work with the other. I then have to force
it into cooperation by routinely calling normalize(), defining my own
white space stripping method for text contents, and skipping text nodes
that I know should not exist.

In the words of a famous poet, "this sucks".

The intent of the white space handling RFC is to get around all these
issues by defining a consistent model that will work the same across all
parsers. Now, within the world of XML solution providers this cannot be
done easily, as many people have pointed out, due to some conflicts in
XML usage. That's fine.

But in the world of applications that are not XML oriented, merely XML
enabled, this can be achieved quite easily. The general assumption is
that application developers are only interested in the informative
content of the XML document, usually through a DOM of similar document
tree. Consistency is easy to achieve by indicating specific XML usage
rules (invoking a DOM parser), and specific document format that,
hopefully, is natural to learn and use (and I believe it is).

This RFC does not present a solution for an XML-centric application that
transforms XML into PDF, or attempts to validate an XML document. But it
does represent a solution for the XML-enabled applications out there
that use XML for property and configuration files, for persistent
storage, for business to business communication, etc. And that is
practically the majority of XML usage we'll see out there, and something
we should be taken care of.

Arkin

Received on Wednesday, 19 May 1999 12:53:09 UTC