Re: Comments on the XPath data model, from a DOM perspective.

Sections marked ">" are from RayWhitmer:

>* It seems clear that the XPath 2.0 specification has no type comparable to
>the node set or other built-in types of XPath 1.0.  The concept of a 
>typeless sequence does not seem to work as effectively.  In many languages,
arrays of
>objects are typed.

In the published December drafts, the type system is not very well
developed. A lot of work has been done on this in the last few months, some
of which is visible in the recent Formal Semantics draft. It has always been
intended that XQuery should offer strong typing. In practice it will usually
be possible to detect statically that a sequence is of a particular type,
e.g. a sequence of nodes or a sequence of integers, though arbitrary
heteregeneous sequences are permitted as the most general case.

>* XPath 1.0 was based on explicitly unordered sets of nodes that could be
>accessed in order.  XPath 2.0 claims that every sequence is ordered, but
>there is not sufficient discussion of what that means, which has caused
>significant confusion.  The logical conclusion could be drawn that it is
>referring to document order, which is the only order it seems to define
>and was the order of XPath 1.0, but this makes no sense when considering
>non-node items now possible in the result sets.  Also, the incompatible
>treatment of duplicates is confusing, if the sets are now ordered, rather
>than unordered, it seems pointless to not eliminate the duplicates, but
>there is probably something lost between the different versions of the
>specification.

Essentially, those expressions which in XPath 1.0 returned a "node-set" have
been redefined in XPath 2.0 to return an "ordered sequence of nodes in
document order without duplicates". Since there is a one-to-one
correspondence between unordered node-sets and ordered node-sequences in
document order, compatibility is preserved. However, XPath 2.0 can also
return sequences in an order other than document order (important when the
user of a Query wants to specify an application-oriented ordering of the
results).

Basically, a sequence can contain items in any order. The order of the
result is determined by the semantics of the expression that created the
sequence. Path expressions produce results in document order, but other
expressions may produce results in a different order.

>Based upon recent discussions, it seems that the XPath 2.0 specification
>may not be comparable or compatible with the XPath 1.0 specification in its
>use of these terms, but the specification needs better treatment of the
>concepts, and explanation of the impact on backwards compatibility.
>Elimination of duplicates also seems like a significant compatibility
>problem since 1.0 implementations went to great lengths to accomplish
>this.

We think we have solved all the important backwards compatibility issues,
but you are right that there is a significant change in terminology and that
we could do a lot more to explain the relationship between XPath 2.0 terms
and their XPath 1.0 equivalents.

>* The copy semantics of node constructors seems wrong even if it was the
>only way to model the lisp semantics that the authors of XPath 2.0 seem
>to be using throughout the specification.  It would seem that a constructed
>node should not lose its identity when inserted into a hierarchy, but
>XPath 2.0 seems to mandate that.

In XSLT, we never make a node available for manipulation until it is
inserted into its hierarchy, so this problem does not appear. It is
potentially a problem for XQuery, where I think the semantics of element
construction still require some further work. The reason it is specified the
way it is, I think, is to ensure that nodes are immutable: you can't have
the parent() accessor on the same node giving different results at different
times.

>* section 4.1, collapse-text-node: what is the parent of the text node
resulting from the collapse operation?

No comment, you may be right that there are problems here.

>* "Descendant nodes" is used but not defined.  Due to the confused use of
>parent relationships of XPath contradicting infoset and other models such
>as DOM, this is important and it can be unclear whether it includes
>attributes, namespaces, etc. where it is used.

Good comment - we shouldn't use a term in the data model if it's not defined
there. The descendant axis will of course be defined in the language
specification.

>* That there should be document order between documents seems strange.
>This makes the ordering of namesace nodes all-the-more bizarre because
>they belong to no document and presumably may be shared between documents,
>so coming at the start of a document or (I can't say I follow the logic
>in this one) ordering after every other node in the document both seem
>impossible and broken.

The model on namespace nodes is certainly broken in the current draft. We
are still debating how best to fix it. We know that we want to relax the
XPath 1.0 rules to allow namespace nodes to be shared between elements, and
we know this has inevitable side-effects on the parentage and ordering of
namespace nodes. But we haven't yet decided exactly what the new rules
should be. All the proposals currently on the table still have namespace
nodes belonging exclusively to a single document.

>Requiring document order between
>documents to be stable requires much better document identification than
>we have today, because if a document is persisted and brought back into
>memory, which can happen at any time during processing, you need to
>be able to go back to something to reestablish the sort in the same way.

The stability of ordering across documents is only required within the scope
of a single query or transformation (though I don't know if we currently say
this very well). Given that document node identity must also be stable
within this scope, I don't think it's difficult to devise implementation
strategies that work, e.g. basing document order on the order of the
internal identifiers of the document nodes.

>* The model claims: "The data model does not support XML documents that are
>not supported by the XML Information Set, for example, non-well-formed
>documents and documents that don't conform to XML Namespaces."  But the
>constructors seem perfectly able to construct objects which are not well-
>formed, for example, by putting "--" into the text of a comment node or
>other illegal characters generally anywhere.

I suspect you are right: there are probably quite a few error conditions
that still need to be documented. The intention is to disallow operations
that create an inconsistent structure, e.g. multiple attributes with the
same name.

>* The model appears to make it possible to construct text nodes that have
>empty strings, elements with multiple ajacent text nodes, and other non-
>normalized result trees.  

Same comment applies.

>* The model appears to make it possible to construct hierarchies which are
>not namespace-well-formed, but makes no mention of how processing will
>occur in those cases.  At the very least, an attribute fragment is not
>namespace-well-formed if it uses namespaces.  And the whole concept of how
>to construct elements properly with namespace nodes seems quite muddy,
>because it would seem to require complete knowledge of all of the
>ancestors to specify a list of namespaces that is consistent with all of
>its ancestors, since it would seem to be an error to ever pass a child to
>the constructor of a parent that does not already contain all the namespace
>nodes of the parent, since XML has no ability to undefine namespaces and
>this would represent an impossible infoset.

At present we have a set of rules for this in the XSLT specification, and we
have a documented issue that we would like to move these rules into the Data
Model instead. The XSLT rules go under the name of "namespace fixup", and
are described essentially as a set of rules to be followed on element
construction to make sure that a valid infoset results.

>I might suggest that you thoroughly study
>the DOM specification and you will find many more border cases you have
>missed.  Construction of a hierarchy using an API is the same problem that
>DOM solves.

I would hope that our problem is simpler, because the set of update
operations is much smaller. But I fear you may have put your finger on a
problem, namely that the set of operations provided by the data model
actually permits sequences of operations that neither XSLT nor XQuery
intends to use, and we need to either explicitly disallow such sequences of
operations, or define their effect precisely. Personally, I've never been
all that happy with the construction side of the data model, because it has
a very procedural feel to it, which seems wrong as it is designed to
underpin a declarative language. XPath 1.0 got round this, of course, by not
describing data model construction at all, describing only the valid states
of the model.

Michael Kay 

Received on Thursday, 4 April 2002 11:51:53 UTC