W3C home > Mailing lists > Public > www-dom@w3.org > April to June 1999

RFC: White Space Handling In XML Parsing

From: Arkin <arkin@trendline.co.il>
Date: Tue, 06 Apr 1999 07:43:58 -0400
Message-ID: <3709F37E.1A7E0968@trendline.co.il>
To: "www-dom@w3.org" <www-dom@w3.org>
White Space Handling In XML Parsing

Status: RFC first draft
Editor: arkin (arkin@openxml.org)
Original copy: http://www.openxml.org/dev/rfc-wshp.html

1. Abstract

White space handling is an unresolved issue in the present definition of
XML parsers, falling outside the scope of both the DOM specification and
the SAX API. This is a recommendation for the behavior of XML parsers in
regards to white space appearing in the source document, and what
are to be delivered to the application.

This RFC is published and made available for public review in an open
process. We encourage parser developers to take part in formulating the
final specification and to abide by it, in an effort to provide a
behavioral model that will allow applications and documents to be
across a variety of parsers.

2. The Problem

White space is defined by XML as any character of the set space, tab and
new-line. Carriage-return is always conveyed to the parser as a
(See sections 2.3 and 2.11 in the XML specification).

White space serves two distinct purposes. The first is to introduce
and line breaks into the element content in a manner that has semantic
significance for the XML application, whether this is to separate word
textual parts, to describe visual formatting, or otherwise.

The second is the use of white space to visually format the document in
source form, e.g. when using a text editor to edit an XML file. Such use
white space is purely to assist the reader or editor of the document.
white space is not part of the information conveyed by the document and
bears no semantic significance for the XML application.

An XML parser that regards all white space as part of the element
and as conveying information might, with liberally formatted documents,
deliver redundant spaces to the application, affecting performance and
memory consumption. In addition, the application must employ special
to remove such white space.

3. Notation

XML Application
     An application that manipulates XML information delivered to it as
     document model. The XML application is interested in the document
     model and the information contained in it, but not in the source
     document proper. This definition is different from the use in the
     specification, where application is used to describe part of the
     parser that builds the document model.

XML Parser
     An integral software component that given an XML source document
     return a document model representing the information and structure
     conveyed by that source document. The XML parser is a superset of
     XML processor described in the XML specification.

Document Model
     The document model returned by the XML parser is equivalent to one
     created in a programmatic fashion. The DOM document tree is one
     document model. The events triggered by a SAX parser are not
     considered a document model as they demand further processing.
     However, a document handler may process them and fire different
     that can be considered a document model.

4. Scope And Effect

This specification defines a contract between the XML source document
the XML parser. The contract clearly defines what portions of the white
space appearing in the source document are a meaningful part of the
document content and must be delivered to the application, and what
of the white space only serve to format the document source and should

Given a source document that contains both types of white space, the XML
parser aims to produce a document model that does not contain less than
more than the meaningful information expressed in the source document,
that document model should be equivalent to one generated in a

This specification is limited to white space appearing in mixed and
content, that is, all characters appearing between the opening and
tags of an element, that are not part of any markup. White space that
appear in attribute values, as well as part of a markup, is outside the
scope of this specification.

This specification assumes that the application is not interested in
processing redundant white space unless specifically expressed by the
application, and that the document itself is capable of distinguishing
between relevant and redundant white space. As such this specification
no implication on the handling of white space as defined in XSL, XQL and
other processing languages.

The behavior of the parser in regards to white space is to be defined in
clear, consistent and conclusive manner so as to allow applications and
documents to be used consistently with different parsers. The same
consistency is to be applied to the manner in which the application and
document exert control over white space handling.

5. Proposed Handling Behavior

The proposed white space handling behavior is expressed as two rule
The first rule set consists of implicit rules that apply if no white
handling behavior is explicitly specified. The second rule set defines
implicit behavior and how to bring it to effect.

5.1. Default Behavior

   * The first sequence of white space immediately after the opening tag
     and the last sequence of white space immediately before the closing
     tag are ignored.

   * All non-space characters (tab and new-line) are translated into a
     space character, and all multiple space characters are consolidated
     into a single space.

   * Sequence of white space occurring between any two markups
     comments, processing instructions, CDATA) except when appearing
     between two elements, is ignored.

   * Sequence of white space occurring between two elements is ignored
     the element is defined to have element content. If the element is
     defined to have mixed content, such white space is treated
     to the first two rules.

   * White space introduced through expansion of character references
     &#32;) or entity references is preserved, and not considered white
     space per the above rules. However, white space appearing in the
     entity declaration is subject to the parsing rules at the time of
     parsing the entity declaration.

   * CDATA sections preserve all white space occurring between the
     <![CDATA[ and closing ]]>.

5.2. Specified Behavior

   * An element requests that white space be preserved by specifying the
     attribute 'xml:space' and using the value 'preserve'. The element
     specify this attribute explicitly or inherit it from the document
     definition. It is recommended that elements specify this attribute

   * Preserving implies that white space is passed as is to the
     application, without any transformation of loss, with the exception
     that, if the first character after the opening tag is a new-line or
     the last character before the closing tag is a new-line, they are

   * Elements that do not specify a value for the 'xml:space' attribute
     inherit that value from the element in which they are contained up
     the root element. If the root element does not specify a value for
     'xml:space' attribute, the value 'default' is assumed.

   * It is possible to instruct the XML parser to supply the root
     with the 'preserve' value for the 'xml:space' attribute, if no
     is explicitly specified for it. (The exact mechanism to TBD)

   * When expanding an entity reference, the value of the 'xml:space'
     attribute of the element in which the entity is expanded has no
     on the expansion of the entity.

6. Mixed Content vs. Element Content

XML element content is either made up only of element (element content),
consists of both element and text (mixed or any content). In the former
case, all white space occurring before, after and between elements in
element content is ignored, and all other characters are reported as
validation errors. In the latter case, white space occurring between
elements is subject to the preserving or consolidation rules.

This approach is clear and consistent, with the exception that a
and non-validating parsers will parse the same document differently. In
some instances it is beneficial to parse documents without the use of a
DTD. In such instances it is recommended that the document be available
without redundant spaces that will cause excessive text nodes to be

A. References

   * Extensible Markup Language (XML) 1.0 W3C Recommendation 10-Feb-98

   * SAX 1.0: The Simple API for XML
Received on Tuesday, 6 April 1999 07:44:59 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 20 October 2015 10:46:05 UTC