Re: RFC: White Space Handling In XML Parsing from Arkin on 1999-05-21 (www-dom@w3.org from April to June 1999)

From: Arkin <arkin@trendline.co.il>
Date: Fri, 21 May 1999 15:05:05 -0400
To: Paul Grosso <pgrosso@arbortext.com>
CC: www-dom@w3.org
Message-ID: <3745AE61.58120E0F@trendline.co.il>
> I'm afraid I remain unclear on why you think an RFC about whitespace
> in XML parsing is necessary or even a good idea.  What about the XML
> spec are you trying to change (and why)?  Or, if you're not trying to
> change something, what's the point of the RFC?

I am not trying to change anything about the XML specs, they are fine as
they are.

There are XML developers that work on XML technologies, for example an
XML editor or XSL processor. They see XML documents at a very technical
level. This RFC is not intended for them. So all arguments about how XML
editors will work with it are rather moot.

But there is a growing majority of software developers that use XML as
part of something else. They are mostly interested in getting the
information model from an XML document, or communicting between two
applications. For them, XML is usually represented in a DOM structure or
similar content model. They do not care much for how it looks on paper,
except when they have to manually write XML documents or debug them.

The problem that I see is that XML documents need to contain redundant
white spaces to make them readable and structured to the human eye when
used with Notepad. The application, on the other end, does not require
these spaces, it only requires the meaningful pieces of information that
are conveyed in the document. Again, I am talking about applications
that use XML, not XML-specific programs.

When an application such as that recieves redundant white spaces, it has
to discard them or ignore them. That requires extra code for getting
around white space. It complicates the application, thus increasing the
likelihood of bugs and making it more expensive to maintain.

Imagine for a second that an application recieves the spaces between
attributes. So, when it needs a specific attribute value, it has to wade
its way through the spaces to extract the attribute value. Instead of
one getAttribute() you have conditions to make sure you recieve the
right value. The application becomes more complex, harder to maintain
and more likely to contain a bug.

Here's a very good example of what I mean. Suppose you build an
application that extracts a book list from a book catalog. It does so by
getting the first item in the node list. The input is:

  <book-list><book>Moby Dick</book><book>Ulysess</book></book-list>

The application does getChildNodes().item(0) and gets the Moby Disk book
element. Now, suppose I format the same document to look different (but
still convey the exact same information):

  <book-list>
    <book>Moby Dick</book>
    <book>Ulysess</book>
  </book-list>

The application does getChildNodes().item(0) and gets an empty text
node. Not a book. It has to check for the empty text node and skip to
the next book. To what purpose?

Again, assuming that this is not an XML editor or XSL processor, it is
clear that the white spaces do not convey any meaningful information to
the application, or serve any purpose other than formatting the source
document. And, since this application is not an XML editor, it does not
care about preserving that original format.

Arkin
Received on Friday, 21 May 1999 15:15:37 UTC