comments on section 5 of the XPath 1.0 specification

The definition of the XPath 1.0 data model in section 5 of the
specification at

   http://www.w3.org/TR/xpath/#data-model
   http://www.w3.org/TR/1999/REC-xpath-19991116/#data-model

seems to me to offer some small opportunities for improvement.  This
mail identifies some of them.  Tools now available make it somewhat
easier today than it was in 1999 to check the logical consequences of
sets of definitions and axioms, and in applying one of those tools to
the XPath 1.0 data model it becomes clear that the current definition
of the data model in section 5 has some gaps in places where it would
probably be better not to have gaps.

(1) It would be desirable, I think, for the definition of the data
model to be formulated in terms of nodes and their relations, without
reference to the XML spec.  It should be easy to see that every
instance of the data model corresponds to an XML document, and every
XML document to an instance of the data model, but the two should be
defined independently of each other.

Large parts of the current text appear to aspire to this goal of
defining the data model independently of the XML spec, but the goal is
not fully achieved.

(2) The current text refers to properties of XML serializations in
describing document order, e.g. in

   There is an ordering, document order, defined on all the nodes in
   the document corresponding to the order in which the first character
   of the XML representation of each node occurs in the XML
   representation of the document after expansion of general entities.

The reference may be taken in either of two ways: (a) as a
non-normative observation about the properties of document order as
defined by normative rules elsewhere, or it can be taken (b) as a
normative appeal to properties of the XML serialization.

Neither seems a wholly satisfactory reading. In the former case (a),
non-normative observation is false: the properties in question are not
in fact guaranteed by the normative rules elsewhere in section 5.  In
the latter case (b), the normative appeal to XML undercuts the
independence of the definition of the data model, and it also requires
a clear mapping from data model constructs to parse trees for XML
documents.  The current text of the spec does not provide such an
explicit mapping.  In case (b), also, some rules explicitly stated in
section 5 would become redundant.  (If document order is normatively
defined as ordering elements in the order of their start-tags, for
example, then it is unnecessary to say, as the same paragraph does,
that parents precede their children.)

Whatever the originally intended rhetorical function of the sentence
quoted, I think it would be better to distinguish clearly between
essential normative statements and informative observations about the
logical consequences of those normative statements, and either to
define document order for data model instances completely without
appeal to the properties of an XML serialization (or source), with a
clearly non-normative statement about the relation between document
order and serial-XML order, or else to define document entirely in
terms of XML order, and make clear that remarks about ancestor nodes
preceding descendant nodes are not themselves normative but are
informative statements describing some individual consequences of the
normative definition.  (I would favor the former, not the latter,
approach, since it makes clearer that XSLT can work on any instance of
the data model, not only on those generated by parsing an XML
character stream.)

(3) The rest of the XPath 1.0 spec relies on the proposition that
document order is a total order, not a partial order, on the nodes of
the document. It would be helpful if that proposition followed
logically from the definition of the data model; in the current text
it does not.

(4) This reader's intuitive understanding of the XPath 1.0 spec as a
whole (and, I believe, most readers' intuitive understandings) is that
the "ordered list of child nodes" possessed by the root node and by
element nodes is related to document order in that for any elements E
and F, if E precedes F in some ordered list of child nodes, then E
precedes F in document order.  It would be helpful if this relation
were specified in the definition of the data model.

(5) The expected behavior of the axes (as I understand them) relies on
the proposition that no node's ordered list of children contains
duplicates.  Several equivalent formulations of this rule are
possible:

   Each node has at most one immediately following sibling and at most
   one immediately preceding sibling.

   The binary node -> node relations underlying the following-sibling
   and preceding-sibling axes are acyclic.)

   No node is its own sibling.

   The number of a parent's children is equal to the length of that
   parent's ordered list of child nodes.

It would be good, I think, if this proposition were clearly entailed
by the definition of the data model.  In the current state of the
spec, this is not the case.

Some readers point to section 5.2 and the sentence "There is an
element node for every element in the document" as entailing the
proposition stated above, but either this is an appeal to the XML
specification or it is not.  If it is not, then it appears to be a
circular argument.  If it is, then it is ineffective because the XML
specification does not define 'element' in a way that allows one to
say with certainty whether different occurrences of the same sequence
of character types count as different elements or as different
occurrences of the same element.

(6) Sentences of the form "There is one node of type N for every N
construct" appear not only where N is "element" but also for other
constructs (e.g. processing instructions and comments).  I take these
statements as a partial description of a mapping from XML documents or
information sets to XPath 1.0 data model instances.  It would be
desirable, I think, to separate discussion of XML-to-datamodel mapping
from definition of the data model in the abstract.

Also, as noted above, these statements appeal to the XML spec for a
conception of identity of elements, processing instructions, comments,
etc. which the XML spec does not in fact provide.  I believe the
principle underlying these statements is, roughly, that for element
nodes, PI nodes, and comment nodes, there should be one node of
appropriate type in the data model instance for each occurrence in the
XML document of any string of characters matching the corresponding
production in the XML grammar.  That principle can, and probably
should, be stated without having to assume some particular view on
whether elements, processing instructions, and comments are by nature
sequences of character types, sequences of character tokens,
occurrences of sequences of character types, or something else.

(7) The parent relation is I think generally thought to be the inverse
of the union of the child, attribute, and namespace relations; it
would be good, I think, if the definition of the data model said this
explicitly.

(8) Some statements in section 5 which appear to be normative are
redundant: they are consequences of other normative statements.  In
some cases, of course, the converse is also true.  For example, the
proposition that nodes never share children follows logically from the
proposition that every node other than the root has exactly one
parent.  Similarly for the proposition that elements never share
attributes (or namespace nodes).

It would be slightly better, perhaps, if it were feasible to make a
clear distinction between normative statements and non-normative
mentions of the logical consequences of the normative statements.

(9) In the spirit of keeping data model instances close to XML and
ensuring that each legal instance is serializable in XML, it would
probably be a good idea to specify in the definition of the data model
that no two attributes on the same element share an expanded name.

(10) In a tree, the parent relation is acyclic.  It would probably be
a good idea if the definition of the data model said explicitly that
the parent relation in the data model is acyclic.  (It may seem to
follow from the analogy to human family relations that parenthood and
ancestry are natrually acyclic, but some readers, at least, of the
XPath specification will be familiar with the humorous song "I'm my
own grandpa", which exhibits a counter-example to that general rule.)


Interested or skeptical readers may wish to consult the following for
further discussion; the first few are postings in my blog and the last
two are Alloy models formalizing various aspects of the XPath 1.0 data
model.

   "An XPath 1.0 Puzzle" http://cmsmcq.com/mib/?p=947

   "Tell me, Captain Yossarian, how many elements do you see?"
   http://cmsmcq.com/mib/?p=955

   "Two, four, three, who~s counting?" http://cmsmcq.com/mib/?p=966

   "How formal can you get?" http://cmsmcq.com/mib/?p=980

   http://www.blackmesatech.com/2010/01/xpath10.als

   http://www.blackmesatech.com/2010/03/otrees.als

None of the opportunities for improvement mentioned above is
particularly difficult to seize and exploit.  In separate email I will
propose some specific changes to the text of XPath 1.0 which
illustrate that the changes needed are mostly rather minor.

-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com
* http://cmsmcq.com/mib
* http://balisage.net
****************************************************************

Received on Friday, 23 April 2010 02:11:11 UTC