Text(): Unifying DOM/Infoset/XPath/XSL from Paul Prescod on 1999-08-11 (www-xpath-comments@w3.org from July to September 1999)

From: Paul Prescod <paul@prescod.net>
Date: Wed, 11 Aug 1999 09:20:34 -0500
To: www-xml-infoset-comments@w3.org, www-dom@w3.org, www-xpath-comments@w3.org, "w3c-xml-linking-ig@w3.org" <w3c-xml-linking-ig@w3.org>
Message-ID: <37B186B2.F4C6CE93@prescod.net>

I think that we are on the verge of creating a terminological
incompatibility that will plague us for years to come. The term text()
is used in XPath to mean "the longest set of contiguous character
objects". In the DOM, however, text nodes may be adjacent. The infoset
has a concept of "characters."

The result of this divergence is already evident in Microsoft's
implementation of XPath-(nee XQL)on-the-DOM. To get the XPath-correct
behavior, you must call the normalize hack (er, method) before you do an
XPath lookup -- no matter how expensive normalize may be in your
implementation.

We can do better. We can, um, normalize this terminology and make the
DOM easier to use at the same time.

I propose the following model:

A character is a character. No matter how it was encoded it is a
character. It is represented by a character node/info item. Its value is
is a character value. APIs and query languages can hang extra
information on the character node if they want (the infoset does). DOM
users would almost never work with characters so they would be created
"on the fly" when they are asked for. Or the DOM could remove the
concept entirely...the model works without it. XSL does not support
character nodes.

A text node is a grouping of characters. The length of the node is
implementation-defined. You can have text nodes right beside text nodes
if you want. Text nodes cannot be reliably enumerated across
implementations.

This doesn't break XSL as much as you might think. It is rare to index
into XSL text nodes by number ("/text()[2]"). Typically you ask for all
of the text nodes and concatenate. The value of a text node is the
string that results from concatenating its characters.

A CDATA section node is a type of text node. In XSL these nodes may be
merged with adjacent text nodes (or may not). In the DOM these nodes are
a subclass of text nodes and so can be treated as text nodes. The value
of a CDATA section node is the string that results from concatenating
its characters.

A character data node is a grouping of text nodes (and CDATA section)
nodes. Character data nodes are never adjacent because they always
extend as far as possible in both directions. Character data nodes can
be reliably counted and enumerated across implementations. Character
data nodes are optional (a "feature") in the DOM but any DOM
implementation that supports XPath will probably support them. The value
of a character data node is the value of concatenating its contained
text nodes.

This model is a little bit more complicated than I would like (and a
little more complicated than what we have in the grove world) but we can
either have consistent complexity or overall complexity through
inconsistency. I think that consistency is the most important thing.

For want of a better place, I'm going to ask that discussion be
redirected to xml-linking unless I get a better suggestion.

 Paul Prescod

Received on Wednesday, 11 August 1999 11:08:34 UTC