Re: Comments on the XPath data model, from a DOM perspective. from Ray Whitmer on 2002-04-29 (www-xml-query-comments@w3.org from April 2002)

From: Ray Whitmer <rayw@netscape.com>
Date: Mon, 29 Apr 2002 13:42:39 -0700
To: www-xml-query-comments@w3.org
Message-ID: <3CCDB03F.8020304@netscape.com>
Sorry, I don't know how this posting wound up here, when I thought I posted 
to www-xpath-comments.  Somehow I expected to be copied on a response.  Or am
I expected to subscribe to this comments list?

>Sections marked ">" are from RayWhitmer:
>
>>* It seems clear that the XPath 2.0 specification has no type comparable to
>>the node set or other built-in types of XPath 1.0.  The concept of a 
>>typeless sequence does not seem to work as effectively.  In many languages,
>arrays of
>>objects are typed.
>
>In the published December drafts, the type system is not very well
>developed. A lot of work has been done on this in the last few months, some
>of which is visible in the recent Formal Semantics draft. It has always been
>intended that XQuery should offer strong typing. In practice it will usually
>be possible to detect statically that a sequence is of a particular type,
>e.g. a sequence of nodes or a sequence of integers, though arbitrary
>heteregeneous sequences are permitted as the most general case.
>
We are more worried about the common XPath 1.0 case of a set of nodes, 
which appears to require an incompatible degradation of the API to support 
XPath 2.0.  It may be acceptable in Lisp to do this where there is no typing
and which we get the idea must have heavily influenced XPath 2.0 because
of the choices it makes, but in other languages lists have types and are 
not equally useful if the typing is disabled as is done for sequences in 
XPath 2.0.

This answer does not seem to answer the question.  An API can claim to
never break anyone by just using the most abstract object type everywhere,
but that is simply not useful, which is why most programming languages
use types, and why a node list is more useful than a list.  There are many
things, including ordering, that apply to nodes that do not apply to untyped
objects.  Just saying the new spec uses untyped everywhere does not solve
compatibility with the old.

>>* XPath 1.0 was based on explicitly unordered sets of nodes that could be
>>accessed in order.  XPath 2.0 claims that every sequence is ordered, but
>>there is not sufficient discussion of what that means, which has caused
>>significant confusion.  The logical conclusion could be drawn that it is
>>referring to document order, which is the only order it seems to define
>>and was the order of XPath 1.0, but this makes no sense when considering
>>non-node items now possible in the result sets.  Also, the incompatible
>>treatment of duplicates is confusing, if the sets are now ordered, rather
>>than unordered, it seems pointless to not eliminate the duplicates, but
>>there is probably something lost between the different versions of the
>>specification.
>
>Essentially, those expressions which in XPath 1.0 returned a "node-set" have
>been redefined in XPath 2.0 to return an "ordered sequence of nodes in
>document order without duplicates". Since there is a one-to-one
>correspondence between unordered node-sets and ordered node-sequences in
>document order, compatibility is preserved. However, XPath 2.0 can also
>return sequences in an order other than document order (important when the
>user of a Query wants to specify an application-oriented ordering of the
>results).
>
I thought that these, and all, return a sequence, not of nodes, but untyped
objects.  While the writer of the expression may believe that the return 
only contains nodes, that does not help at all in a formal type system, and
it confuses greatly the concept of ordering.

This is not compatible at all, unless Lisp is your language and you always
disregarded types anyway.

>Basically, a sequence can contain items in any order. The order of the
>result is determined by the semantics of the expression that created the
>sequence. Path expressions produce results in document order, but other
>expressions may produce results in a different order.
>
But in XPath 1.0 a node set could always be accessed in document order and
with guaranteed uniqueness of results.  In XPath 2.0, document order makes 
less sense, because your items may noteven be nodes.  This seems to require 
different semantics than accessingthe items of a result in document order.

>>Based upon recent discussions, it seems that the XPath 2.0 specification
>>may not be comparable or compatible with the XPath 1.0 specification in its
>>use of these terms, but the specification needs better treatment of the
>>concepts, and explanation of the impact on backwards compatibility.
>>Elimination of duplicates also seems like a significant compatibility
>>problem since 1.0 implementations went to great lengths to accomplish
>>this.
>
>We think we have solved all the important backwards compatibility issues,
>but you are right that there is a significant change in terminology and that
>we could do a lot more to explain the relationship between XPath 2.0 terms
>and their XPath 1.0 equivalents.
>
And I am still looking for evidence of a solution to the compatibility issues
that were raised at the beginning such as the ordering, typing, and returns
which seem to be incompatible, except for Lisp programmers in some cases,
let alone all the compatibility issues with the extended Lisp DOM APIs being
created by the XPath group.

>>* The copy semantics of node constructors seems wrong even if it was the
>>only way to model the lisp semantics that the authors of XPath 2.0 seem
>>to be using throughout the specification.  It would seem that a constructed
>>node should not lose its identity when inserted into a hierarchy, but
>>XPath 2.0 seems to mandate that.
>
>In XSLT, we never make a node available for manipulation until it is
>inserted into its hierarchy, so this problem does not appear. It is
>potentially a problem for XQuery, where I think the semantics of element
>construction still require some further work. The reason it is specified the
>way it is, I think, is to ensure that nodes are immutable: you can't have
>the parent() accessor on the same node giving different results at different
>times.
>
Then how are the constructor arguments passed if there is no reference to them?

I think there is a reference to them before they are passed to the constructor,
so the id of the copy will be different from the id of the passed object.

>The model on namespace nodes is certainly broken in the current draft. We
>are still debating how best to fix it. We know that we want to relax the
>XPath 1.0 rules to allow namespace nodes to be shared between elements, and
>we know this has inevitable side-effects on the parentage and ordering of
>namespace nodes. But we haven't yet decided exactly what the new rules
>should be. All the proposals currently on the table still have namespace
>nodes belonging exclusively to a single document.
>
If they belong to a document, then you will have to add an ownerDocument 
attribute, which the infoset does not have, to allow that ordering and identity
checking to occur.

It is hard without resolution on the issue.

>>Requiring document order between
>>documents to be stable requires much better document identification than
>>we have today, because if a document is persisted and brought back into
>>memory, which can happen at any time during processing, you need to
>>be able to go back to something to reestablish the sort in the same way.
>
>The stability of ordering across documents is only required within the scope
>of a single query or transformation (though I don't know if we currently say
>this very well). Given that document node identity must also be stable
>within this scope, I don't think it's difficult to devise implementation
>strategies that work, e.g. basing document order on the order of the
>internal identifiers of the document nodes.
>
If you make the requirement of adding internal identifiers to the DOM
implementation.  In a Java implementation, for example, there is no id
available that is guaranteed to be unique for any object.

And while you may be able to wave away the issue of lifetimes, those
working with a model such as DOM may not be able to.

>>* The model claims: "The data model does not support XML documents that are
>>not supported by the XML Information Set, for example, non-well-formed
>>documents and documents that don't conform to XML Namespaces."  But the
>>constructors seem perfectly able to construct objects which are not well-
>>formed, for example, by putting "--" into the text of a comment node or
>>other illegal characters generally anywhere.
>
>I suspect you are right: there are probably quite a few error conditions
>that still need to be documented. The intention is to disallow operations
>that create an inconsistent structure, e.g. multiple attributes with the
>same name.
>
But what if these conditions do not match between DOM and XPath and you then
try to build XPath on top of DOM?

>>* The model appears to make it possible to construct text nodes that have
>>empty strings, elements with multiple ajacent text nodes, and other non-
>>normalized result trees.  
>
>Same comment applies.
>
But what if these conditions do not match between DOM and XPath and you then
try to build XPath on top of DOM?

>At present we have a set of rules for this in the XSLT specification, and we
>have a documented issue that we would like to move these rules into the Data
>Model instead. The XSLT rules go under the name of "namespace fixup", and
>are described essentially as a set of rules to be followed on element
>construction to make sure that a valid infoset results.
>
If it were to rely on a fixup, then why pass namespace nodes to the element 
constructor at all?  Also, how do copy semantics work with the namespace nodes
if there is only one per document of a particular type?

Also, there is likely to be confusion with the DOM notion of namespace fixup,
which is apparently not very similar in what it will fix and what it will not.

When reading these sections, I have a lot of questions created by the over-
simple description, naturally because you are redefining a document object 
model.  I guess I just need to create a much longer issue list.

>>I might suggest that you thoroughly study
>>the DOM specification and you will find many more border cases you have
>>missed.  Construction of a hierarchy using an API is the same problem that
>>DOM solves.
>
>I would hope that our problem is simpler, because the set of update
>operations is much smaller. But I fear you may have put your finger on a
>problem, namely that the set of operations provided by the data model
>actually permits sequences of operations that neither XSLT nor XQuery
>intends to use, and we need to either explicitly disallow such sequences of
>operations, or define their effect precisely. Personally, I've never been
>all that happy with the construction side of the data model, because it has
>a very procedural feel to it, which seems wrong as it is designed to
>underpin a declarative language. XPath 1.0 got round this, of course, by not
>describing data model construction at all, describing only the valid states
>of the model.
>
I doubt that the constructors are simpler.  XPath constructors seem quite a bit
more complex due to copy constraints.  You require lots of arguments, copying,
etc. and so XPath has lots of failures that in DOM occur later during
manipulation because it does all of its construction through arguments.

You are recreating DOM in many ways, but incompatibly.  In many cases, DOM has 
solved the issues and XPath 2.0 has not. NIH should have no place at W3C.  We
need resolution of the many issues now, as compatibly as possible with DOM, 
or they will be issues for last call and beyond.

Ray Whitmer
rayw@netscape.com
Received on Monday, 29 April 2002 17:10:29 UTC