RFC: White Space Handling In XML Parsing

White Space Handling In XML Parsing
-----------------------------------


Status: RFC first draft
Editor: arkin (arkin@openxml.org)
Original copy: http://www.openxml.org/dev/rfc-wshp.html


1. Abstract

White space handling is an unresolved issue in the present definition of
XML parsers, falling outside the scope of both the DOM specification and
the SAX API. This is a recommendation for the behavior of XML parsers in
regards to white space appearing in the source document, and what
portions
are to be delivered to the application.

This RFC is published and made available for public review in an open
process. We encourage parser developers to take part in formulating the
final specification and to abide by it, in an effort to provide a
uniform
behavioral model that will allow applications and documents to be
portable
across a variety of parsers.


2. The Problem

White space is defined by XML as any character of the set space, tab and
new-line. Carriage-return is always conveyed to the parser as a
new-line.
(See sections 2.3 and 2.11 in the XML specification).

White space serves two distinct purposes. The first is to introduce
spaces
and line breaks into the element content in a manner that has semantic
significance for the XML application, whether this is to separate word
and
textual parts, to describe visual formatting, or otherwise.

The second is the use of white space to visually format the document in
its
source form, e.g. when using a text editor to edit an XML file. Such use
of
white space is purely to assist the reader or editor of the document.
This
white space is not part of the information conveyed by the document and
bears no semantic significance for the XML application.

An XML parser that regards all white space as part of the element
content
and as conveying information might, with liberally formatted documents,
deliver redundant spaces to the application, affecting performance and
memory consumption. In addition, the application must employ special
code
to remove such white space.


3. Notation

XML Application
     An application that manipulates XML information delivered to it as
a
     document model. The XML application is interested in the document
     model and the information contained in it, but not in the source
     document proper. This definition is different from the use in the
XML
     specification, where application is used to describe part of the
XML
     parser that builds the document model.

XML Parser
     An integral software component that given an XML source document
will
     return a document model representing the information and structure
     conveyed by that source document. The XML parser is a superset of
the
     XML processor described in the XML specification.

Document Model
     The document model returned by the XML parser is equivalent to one
     created in a programmatic fashion. The DOM document tree is one
such
     document model. The events triggered by a SAX parser are not
     considered a document model as they demand further processing.
     However, a document handler may process them and fire different
events
     that can be considered a document model.


4. Scope And Effect

This specification defines a contract between the XML source document
and
the XML parser. The contract clearly defines what portions of the white
space appearing in the source document are a meaningful part of the
document content and must be delivered to the application, and what
portion
of the white space only serve to format the document source and should
be
ignored.

Given a source document that contains both types of white space, the XML
parser aims to produce a document model that does not contain less than
or
more than the meaningful information expressed in the source document,
and
that document model should be equivalent to one generated in a
programmatic
fashion.

This specification is limited to white space appearing in mixed and
element
content, that is, all characters appearing between the opening and
closing
tags of an element, that are not part of any markup. White space that
appear in attribute values, as well as part of a markup, is outside the
scope of this specification.

This specification assumes that the application is not interested in
processing redundant white space unless specifically expressed by the
application, and that the document itself is capable of distinguishing
between relevant and redundant white space. As such this specification
has
no implication on the handling of white space as defined in XSL, XQL and
other processing languages.

The behavior of the parser in regards to white space is to be defined in
a
clear, consistent and conclusive manner so as to allow applications and
documents to be used consistently with different parsers. The same
consistency is to be applied to the manner in which the application and
document exert control over white space handling.


5. Proposed Handling Behavior

The proposed white space handling behavior is expressed as two rule
sets.
The first rule set consists of implicit rules that apply if no white
space
handling behavior is explicitly specified. The second rule set defines
such
implicit behavior and how to bring it to effect.


5.1. Default Behavior

   * The first sequence of white space immediately after the opening tag
     and the last sequence of white space immediately before the closing
     tag are ignored.

   * All non-space characters (tab and new-line) are translated into a
     space character, and all multiple space characters are consolidated
     into a single space.

   * Sequence of white space occurring between any two markups
(elements,
     comments, processing instructions, CDATA) except when appearing
     between two elements, is ignored.

   * Sequence of white space occurring between two elements is ignored
if
     the element is defined to have element content. If the element is
     defined to have mixed content, such white space is treated
according
     to the first two rules.

   * White space introduced through expansion of character references
(e.g.
      ) or entity references is preserved, and not considered white
     space per the above rules. However, white space appearing in the
     entity declaration is subject to the parsing rules at the time of
     parsing the entity declaration.

   * CDATA sections preserve all white space occurring between the
opening
     <![CDATA[ and closing ]]>.


5.2. Specified Behavior

   * An element requests that white space be preserved by specifying the
     attribute 'xml:space' and using the value 'preserve'. The element
may
     specify this attribute explicitly or inherit it from the document
type
     definition. It is recommended that elements specify this attribute
     explicitly.

   * Preserving implies that white space is passed as is to the
     application, without any transformation of loss, with the exception
     that, if the first character after the opening tag is a new-line or
     the last character before the closing tag is a new-line, they are
     ignored.

   * Elements that do not specify a value for the 'xml:space' attribute
     inherit that value from the element in which they are contained up
to
     the root element. If the root element does not specify a value for
the
     'xml:space' attribute, the value 'default' is assumed.

   * It is possible to instruct the XML parser to supply the root
element
     with the 'preserve' value for the 'xml:space' attribute, if no
value
     is explicitly specified for it. (The exact mechanism to TBD)

   * When expanding an entity reference, the value of the 'xml:space'
     attribute of the element in which the entity is expanded has no
affect
     on the expansion of the entity.


6. Mixed Content vs. Element Content

XML element content is either made up only of element (element content),
or
consists of both element and text (mixed or any content). In the former
case, all white space occurring before, after and between elements in
the
element content is ignored, and all other characters are reported as
validation errors. In the latter case, white space occurring between
elements is subject to the preserving or consolidation rules.

This approach is clear and consistent, with the exception that a
validating
and non-validating parsers will parse the same document differently. In
some instances it is beneficial to parse documents without the use of a
DTD. In such instances it is recommended that the document be available
without redundant spaces that will cause excessive text nodes to be
generated.


A. References

   * Extensible Markup Language (XML) 1.0 W3C Recommendation 10-Feb-98
     http://www.w3.org/TR/1998/REC-xml-19980210

   * SAX 1.0: The Simple API for XML
     http://www.megginson.com/SAX/

Received on Tuesday, 6 April 1999 07:44:59 UTC