change proposals for section 5 of the XPath 1.0 specification

In separate mail [1] I have noted some opportunities for improving the
definition of the data model in the XPath 1.0 spec.  This note
presents a set of concrete wording proposals to seize some of those
opportunities.  They are gathered together in groups according to
purpose.

[1] http://lists.w3.org/Archives/Public/www-xpath-comments/2010AprJun/0000.html

(1) One change is a simple typo correction.

In 5.7 Text Nodes, in para 2, for

     Thus, <![CDATA[<]]> in the source document will treated the same
     as &lt;.

insert "be" before 'treated", to read

     Thus, <![CDATA[<]]> in the source document will be treated the
     same as &lt;.


(2) Several changes are intended to create a clean separation between
discussion of XML-to-datamodel mapping issues on the one hand, and on
the other the definition of the data model in the abstract.  They all
involve changes to passages in the current text of the form "There is
an X node for every X in the XML document."

The problem is that sentences like

     There is an element node for every element in the document.

have no useful specification-defined meaning.  It may or may not have
been a good decision, but the decision not to attempt an explicit
definition of the nature of things like "document", "element",
etc. was certainly a conscious decision on the part of at least some
members of the responsible working group.  The XML spec does not
prescribe an answer to the question how many 'b' elements occur in a
document which, after entity expansion, reads

     <a><b/><c><b/><b/></c></a>

For this case, the answers 1, 2, and 3 are all compatible with the XML
specification and with coherent positions on the nature of
element-hood.  The rest of the XPath 1.0 spec relies on there being
three element nodes for 'b' elements in the data model instance
corresponding to this document, so the definition of the data model
needs to guarantee that result.  At present, it does not.

All of the change proposals below are thus careful to define the XML
serial-form analogues of the XDM nodes as occurrences of strings in
the XML character sequence, and not to appeal to the XML spec for
non-existent rules concerning identity criteria for elements,
processing instructions, and comments.

(2a) In 5.2 Element Nodes, in the first paragraph, replace

     There is an element node for every element in the document.

with

     When a tree is constructed from an XML document, the tree contains
     one element node for every occurrence in the document of any
     string of characters matching the "element" production of [XML].

(2b) In 5.5 Processing Instruction Nodes, replace the first paragraph

     There is a processing instruction node for every processing
     instruction, except for any processing instruction that occurs
     within the document type declaration.

with

     When a tree is constructed from an XML document, the tree contains
     one processing instruction node for every occurrence in the
     document of any string of character matching the "PI" (processing
     instruction) production of [XML], except for those that occur
     within the document type declaration.

(2c) In 5.6 Comment Nodes, for paragraph 1

     There is a comment node for every comment, except for any comment
     that occurs within the document type declaration.

substitute

     When a tree is constructed from an XML document, the tree contains
     one comment node for every occurrence in the document of any
     string of characters matching the "Comment" production of [XML],
     except for those occurring within the document type declaration.


(3) A few changes specify explicit rules for data model instances
which guarantee that they have legal XML serializations.

(3a) In 5.1 Root Node, for the first paragraph

     The root node is the root of the tree. A root node does not occur
     except as the root of the tree. The element node for the document
     element is a child of the root node. The root node also has as
     children processing instruction and comment nodes for processing
     instructions and comments that occur in the prolog and after the
     end of the document element.

substitute

     The root node is the root of the tree; alone among the nodes of
     the tree it has no parent. A root node does not occur except as
     the root of the tree. The element node for the document element is
     a child of the root node; it is the only element node among the
     root node's children. The root node also has as children
     processing instruction and comment nodes for processing
     instructions and comments that occur in the prolog and after the
     end of the document element.

That is: insert "; alone among the nodes of the tree it has no parent"
at the end of the first sentence, and "; it is the only element node
among the root node's children" at the end of the third sentence.  The
first makes explicit a necessary property of the parent relation in
XDM instances; the second guarantees that the serialization of the
document node will match production [1] of the XML specification.

(3b) In 5.3 Attribute Nodes, in paragraph 5

     An attribute node has an expanded-name and a string-value. The
     expanded-name is computed by expanding the QName specified in the
     tag in the XML document in accordance with the XML Namespaces
     Recommendation [XML Names]. The namespace URI of the attribute's
     name will be null if the QName of the attribute does not have a
     prefix.

insert at the end of the paragraph

     Any two distinct attribute nodes of the same parent element must
     have diffent expanded names.

(3c) In 5.4 Namespace Nodes, in paragraph 2

     A namespace node has an expanded-name: the local part is the
     namespace prefix (this is empty if the namespace node is for the
     default namespace); the namespace URI is always null.

append

     Any two distinct namespace nodes of the same parent element must
     have diffent expanded names.

Changes 3a and 3b together guarantee that the serialization of an
element will not violate the WF constraint "Unique Att Spec".

(4) One change is intended to set the stage rhetorically for the
slightly more formal definition of the data model presented in change
proposal 5 below.

In section 5, change the first paragraph, which currently reads

     XPath operates on an XML document as a tree. This section
     describes how XPath models an XML document as a tree. This model
     is conceptual only and does not mandate any particular
     implementation. The relationship of this model to the XML
     Information Set [XML Infoset] is described in [B XML Information
     Set Mapping].

to read

     XPath operates on an XML document as a tree. This section
     describes how XPath models an XML document as a tree ^by defining
     a set of constraints on the nodes of the tree and on the relations
     holding between nodes^. This model is conceptual only and does not
     mandate any particular implementation. The relationship of this
     model to the XML Information Set [XML Infoset] is described in [B
     XML Information Set Mapping].

The ^...^ mark the only change, an insertion.

(5) Finally, one larger change is intended to make the definition of
the data model be wholly independent of the XML specification, and to
ensure that the definition of the data model guarantees that all
instances of the data model have the properties relied upon elsewhere
in the XPath 1.0 spec.

In section 5, delete the two paragraphs

     There is an ordering, document order, defined on all the nodes in
     the document corresponding to the order in which the first
     character of the XML representation of each node occurs in the XML
     representation of the document after expansion of general
     entities. Thus, the root node will be the first node. Element
     nodes occur before their children. Thus, document order orders
     element nodes in order of the occurrence of their start-tag in the
     XML (after expansion of entities). The attribute nodes and
     namespace nodes of an element occur before the children of the
     element. The namespace nodes are defined to occur before the
     attribute nodes. The relative order of namespace nodes is
     implementation-dependent. The relative order of attribute nodes is
     implementation-dependent. Reverse document order is the reverse of
     document order.

     Root nodes and element nodes have an ordered list of child
     nodes. Nodes never share children: if one node is not the same
     node as another node, then none of the children of the one node
     will be the same node as any of the children of another
     node. Every node other than the root node has exactly one parent,
     which is either an element node or the root node. A root node or
     an element node is the parent of each of its child nodes. The
     descendants of a node are the children of the node and the
     descendants of the children of the node.

and insert the following:

     Each tree consists of a set of nodes and two binary relations on
     those nodes, named parent and next-sibling, which satisfy the
     following constraints:

       - The parent relation is a function from nodes to element
         nodes or root nodes: that is, each node has at most one
         parent, which is either an element node or the root node.

         The parent of an attribute node or a namespace node is an
         element node (not a root node).

       - The next-sibling relation is likewise a function, from (and
         to) nodes other than attribute nodes and namespace nodes.
         Each node which is neither an attribute node nor a namespace
         node has at most one next sibling, which is also neither an
         attribute node nor a namespace node.

         If (and only if) any two nodes are related by the next-sibling
         relation (that is, if it is possible to start at one node and
         reach the other by traversing the next-sibling relation one or
         more times), then the two nodes are *siblings*.

       - The two relations are acyclic: it is never possible, by
         following the parent or next-sibling relation repeatedly from
         node to node, to return to a node already visited.

       - Each tree contains exactly one root node.

       - Each node other than the root node has a parent; the root
         node has none.

             Note: it follows that from any node in the tree it is
             possible to reach the root by traversing the parent
             relation repeatedly (zero or more times).

       - If any two nodes are siblings, then those two nodes have the
         same parent.

       - Conversely, if any two nodes other than attribute or
         namespace nodes have the same parent, then they are siblings.

     Several other terms and relations on the nodes of a tree can be
     defined in terms of the parent and next-sibling relations.

     The attribute nodes and namespace nodes which have an element node
     as their parent are called the attribute nodes, or the namespace
     nodes, of that element node.  All other nodes which have a node as
     their parent are called the children of that node.  Formally,
     these facts can be represented by relations called attributes-of,
     namespace-nodes-of, and child.  The union of the attributes-of,
     namespace-nodes-of, and child relations is the inverse of the
     parent relation.

     The positive transitive closure of the parent relation is the
     ancestor relation.  The positive transitive closure of the child
     relation is the descendant relation.  That is, for any nodes A and
     B, if the pair (A -> B) is a member of the ancestor relation,  
then B
     is an ancestor of A.  Similarly, if the descendant includes
     (A -> B), then B is a descendant of A.

     The inverse of the next-sibling relation is the previous-sibling
     relation.  The positive transitive closures of the next-sibling
     and previous-sibling relations are the following-sibling and
     preceding-sibling relations, respectively.

         Note: it follows from the definitions given that the
         next-sibling relation defines a total order on the children of
         any node.

     A total order, called *document order*, is defined on all nodes of
     the tree, as described below.  It is convenient to write A << B
     for two nodes A and B, if A precedes B in document order, or
     A >> B if B precedes B in document order.

       1 Parents precede their namespace nodes, attributes, and
       children in document order.

       2 For any two nodes A and B, where B is the next-sibling of A, A
       itself, every descendant of A, and every attribute node or
       namespace node of A or of any descendant of A precedes B in
       document order.

       3 The attribute nodes and namespace nodes of an element precede
       the children of that element in document order.  The namespace
       nodes of an element precede the attribute nodes of that element
       in document order; otherwise, the relative order of the
       attribute nodes and namespace nodes of a given element is
       implementation-dependent.

     Reverse document order is the inverse of document order.

       Note: from the definition of document order as a total order, it
       follows that document order is:

         transitive: for any nodes A, B, and C, if A << B and B << C,
           then A << C

         irreflexive: for no node A is it true that A << A.

         antisymmetric: for any nodes A and B, if A << B then it is not
           true that B << A.

         complete: for any two nodes A and B, either A << B or B << A.

     Any tree defined by the relations thus described has several
     properties which may be usefully mentioned here:

       - The root node is the first node in document order and
         precedes all other nodes.

       - The relative order of nodes other than attribute nodes and
         namespace nodes corresponds to the relative order in which the
         first character of the XML representation of each node occurs
         in the XML representation of the document.

             Note: This assumes a one-to-one correspondence between the
             nodes of the tree and the individual occurrences, in an
             XML document in which all general entity references have
             been expanded, of strings matching the corresponding
             grammatical rules of [XML] -- one element node for each
             occurrence of a string matching the 'element' production,
             and vice versa, etc.

       - Nodes never share children, attributes, or namespace nodes.
         For any nodes A, B, and C, if A is both a child of B and a
         child of C, then B and C are the same node.  And similarly for
         attribute nodes and namespace nodes.

       - No node is its own ancestor, descendant, or sibling.

       - The root node has no siblings.

     Some additional constraints on instances of the data model are
     given in the following sub-sections.

-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com
* http://cmsmcq.com/mib
* http://balisage.net
****************************************************************

Received on Friday, 23 April 2010 02:14:47 UTC