- From: Paul Prescod <paul@prescod.net>
- Date: Wed, 11 Aug 1999 09:20:34 -0500
- To: www-xml-infoset-comments@w3.org, www-dom@w3.org, www-xpath-comments@w3.org, "w3c-xml-linking-ig@w3.org" <w3c-xml-linking-ig@w3.org>
I think that we are on the verge of creating a terminological incompatibility that will plague us for years to come. The term text() is used in XPath to mean "the longest set of contiguous character objects". In the DOM, however, text nodes may be adjacent. The infoset has a concept of "characters." The result of this divergence is already evident in Microsoft's implementation of XPath-(nee XQL)on-the-DOM. To get the XPath-correct behavior, you must call the normalize hack (er, method) before you do an XPath lookup -- no matter how expensive normalize may be in your implementation. We can do better. We can, um, normalize this terminology and make the DOM easier to use at the same time. I propose the following model: A character is a character. No matter how it was encoded it is a character. It is represented by a character node/info item. Its value is is a character value. APIs and query languages can hang extra information on the character node if they want (the infoset does). DOM users would almost never work with characters so they would be created "on the fly" when they are asked for. Or the DOM could remove the concept entirely...the model works without it. XSL does not support character nodes. A text node is a grouping of characters. The length of the node is implementation-defined. You can have text nodes right beside text nodes if you want. Text nodes cannot be reliably enumerated across implementations. This doesn't break XSL as much as you might think. It is rare to index into XSL text nodes by number ("/text()[2]"). Typically you ask for all of the text nodes and concatenate. The value of a text node is the string that results from concatenating its characters. A CDATA section node is a type of text node. In XSL these nodes may be merged with adjacent text nodes (or may not). In the DOM these nodes are a subclass of text nodes and so can be treated as text nodes. The value of a CDATA section node is the string that results from concatenating its characters. A character data node is a grouping of text nodes (and CDATA section) nodes. Character data nodes are never adjacent because they always extend as far as possible in both directions. Character data nodes can be reliably counted and enumerated across implementations. Character data nodes are optional (a "feature") in the DOM but any DOM implementation that supports XPath will probably support them. The value of a character data node is the value of concatenating its contained text nodes. This model is a little bit more complicated than I would like (and a little more complicated than what we have in the grove world) but we can either have consistent complexity or overall complexity through inconsistency. I think that consistency is the most important thing. For want of a better place, I'm going to ask that discussion be redirected to xml-linking unless I get a better suggestion. Paul Prescod
Received on Wednesday, 11 August 1999 11:08:35 UTC