Re: RFC: White Space Handling In XML Parsing

I see Arkin's point.

We use XML  for serializing/deserializing Java objects.  We currently
use a parser that does not preserve 'formatting' whitespace. This makes
walking the DOM tree somewhat faster since uninteresting  whitespace
text nodes don't have to be skipped and the storage costs are much less,
which is extremely important when you have a document cache of several
million nodes.

For this kind of application a non-validating, non-whitespace-preserving
parser is very handy, since all XML is generated and read by machine only
and no useful information is carried in the extraneous whitespace.

Regardless of whether whitespace handling by parsers in the spec or
an RFC I think it behooves parser writers to provide both options
(whitespace-preserving for editors etc., and non-whitespace-preserving).
It would be nice if the mechanism used to switch modes was a
public standard though.

There are a lot of server-side XML applications that read/write XML
that does not need to be human-readable. If, for some reason,
it needs to be visually perused then you could use an XML pretty-printer.

Claude Zervas
Uniplanet LLC
claude@uniplanet.com

At 04:43 PM 5/21/99 , Paul Grosso wrote:
>At 15:05 1999 05 21 -0400, Arkin wrote:
>>> I'm afraid I remain unclear on why you think an RFC about whitespace
>>> in XML parsing is necessary or even a good idea.  What about the XML
>>> spec are you trying to change (and why)?  Or, if you're not trying to
>>> change something, what's the point of the RFC?
>>
>>I am not trying to change anything about the XML specs, they are fine as
>>they are.
>
>>Here's a very good example of what I mean. Suppose you build an
>>application that extracts a book list from a book catalog. It does so by
>>getting the first item in the node list. The input is:
>>
>>  <book-list><book>Moby Dick</book><book>Ulysess</book></book-list>
>>
>>The application does getChildNodes().item(0) and gets the Moby Disk book
>>element. Now, suppose I format the same document to look different (but
>>still convey the exact same information):
>>
>>  <book-list>
>>    <book>Moby Dick</book>
>>    <book>Ulysess</book>
>>  </book-list>
>
>I still think your use of the word "format" to refer to the source 
>document is confusing--even to yourself.  Because it's making you 
>think that those spaces, in some sense, "don't count" because they
>are "only there for formatting" and "formatting" isn't really part of
>the document content.
>
>You're wrong about that.  The input is the input, spaces in data content
>of a document have nothing to do with "formatting," and those spaces are 
>really there.
>
>>The application does getChildNodes().item(0) and gets an empty text
>>node. Not a book. It has to check for the empty text node and skip to
>>the next book. To what purpose?
>
>The solution is to use some kind of "filter" in the DOM to ask for the
>next element node if that's what you want.  Just pretending the spaces
>aren't there--even if that made sense--wouldn't solve your problem given
>that things like comments and PIs could also be "in the way" between the
>elements you wish to see (to say nothing of the mess you'd have if some
>of your <book>...</book> elements really got there by being the replacement
>text of some entities).
> 

Received on Saturday, 22 May 1999 14:57:13 UTC