- From: Arkin <arkin@trendline.co.il>
- Date: Wed, 19 May 1999 12:42:52 -0400
- CC: www-dom@w3.org
I would like to clarify some of the rational behind the white space handling RFC. While developing the OpenXML parser architecture I was requested to supply two white space handling models for two distinct groups of developers. One group was working on XML technologies, such as XSL and XQL processing, and was concerned that by handling white space the XML parser might "hide" some important details from it. This class of developers can get by with using the XML parser in a mode that does not perform any white space handling. Many of the people posting on this mailing list fall under this category. The second and larger group were developers using XML to achieve their application goals. They do not understand XML to its finest details and do not care so much about the fine details of XML processing. For them, the role of an XML parser is to provide a consistent information model based on the contents of a textual file. And for them, I wrote an RFC that would allow them to get this form of functionality with the least of hassle. Now, if you look at white space handling from the business application perspective you realize that: * 90% of the time white space is used to format the source document so people can read it -- it's redundant * 90% of the elements either do not contain text at all, or contain text that is subject to white space stripping * The remainder can be clearly expressed with the use of a DTD or xml:space attribute * Application developers and XML document writers tend to have some experience with HTML, which clearly defines white space behavior (if nothing else) * A generic application is only interested in the actual information model; if that information is encumbered with white space, the application will have to spend some of its time getting rid of these excessive white space Application developers can write white space stripping code for each application based on specific DTD; this is generally known as poor engineering; or they can write generic stripping code; or they can get that functionality from the XML parser. As the current situation is, when I use an XML parser to read a book catalog file, I get different document trees depending on the parser I use. Sun ProjectX will tend to not normalize text nodes, IBM XML4J will tend to produce excessive white space nodes, and OpenXML will tend to kill too much white space. When I write an application with one parser, I can be positive it will not work with the other. I then have to force it into cooperation by routinely calling normalize(), defining my own white space stripping method for text contents, and skipping text nodes that I know should not exist. In the words of a famous poet, "this sucks". The intent of the white space handling RFC is to get around all these issues by defining a consistent model that will work the same across all parsers. Now, within the world of XML solution providers this cannot be done easily, as many people have pointed out, due to some conflicts in XML usage. That's fine. But in the world of applications that are not XML oriented, merely XML enabled, this can be achieved quite easily. The general assumption is that application developers are only interested in the informative content of the XML document, usually through a DOM of similar document tree. Consistency is easy to achieve by indicating specific XML usage rules (invoking a DOM parser), and specific document format that, hopefully, is natural to learn and use (and I believe it is). This RFC does not present a solution for an XML-centric application that transforms XML into PDF, or attempts to validate an XML document. But it does represent a solution for the XML-enabled applications out there that use XML for property and configuration files, for persistent storage, for business to business communication, etc. And that is practically the majority of XML usage we'll see out there, and something we should be taken care of. Arkin
Received on Wednesday, 19 May 1999 12:53:09 UTC