Re: XPath host language for querying several XML fragments from Javier Godoy on 2008-01-22 (public-qt-comments@w3.org from January 2008)

From: Javier Godoy <rjgodoy@fich.unl.edu.ar>
Date: Tue, 22 Jan 2008 03:11:18 -0200
To: "Michael Kay" <mhk@mhk.me.uk>, <public-qt-comments@w3.org>
Cc: "'Sharon Adler'" <sca@us.ibm.com>, "'Andrew Eisenberg'" <andrew.eisenberg@us.ibm.com>, "'Jim Melton'" <jim.melton@acm.org>, "Hugo Minni" <hminni4k@yahoo.com.ar>
Message-ID: <019701c85cb5$3b9f9810$017ba8c0@Javier>
Thanks you very much for your opinions.


Michael Kay wrote,
[http://lists.w3.org/Archives/Public/www-xpath-comments/2008JanMar/0001.html]

> A couple of editorial points first:
>
> (a) you should surely be referring to the XPath 2.0 Recommendation of 23
> January 2007 rather than the Proposed Recommendation of 21 November 2006.
> (I would also suggest that you avoid referring to a specifically dated
> version, so that you refer the reader to the latest edition at any given
> time, which may incorporate errata.)

Thanks for pointing this error. Indeed, I was working with
REC-xpath20-20070123, but the bibliographical database i used pointed to an
older version. I haven't noticed that.

---------


> (c) since the namespace prefix "xs" is often used to refer to the XML
> Schema namespace, it might be clearer to your readers if you chose
> a prefix other than "XS" - perhaps "WXS"?

Good point. I though there would be no confusion since "xs" is not
normatively bound to XML schemas, but now I realize it will be clearer if I
used a different prefix. I will change it. WXS is a good alternative.

---------


> Now a general policy point:
>
> (d) there are many people who seem to perceive a need for subsetting
> XPath, with a variety of objectives that usually include (i) reducing the
> cost of
> implementation, and (ii) making it harder for users to specify expressions
> that will be expensive to evaluate.

Objetive (i) is not actually our goal (it could be a consequence of some
restrictions as
I had understood them but, as you stated, modifying an existing XPath
implementation for avoiding expensive operations would *increase* the
implementation costs). The phrase "reduce the cost of implementing this
specification" will be removed because it is misleading.

Objetive (ii) is closer to our, but it is not intended as a way for
protecting the system (i.e. "avoid expensive queries as a security
measure"), since there are many other valid (and required, even after
subsetting) expressions which are too expensive. Implementors should have to
deal with this expressions (and reject them if appropriate).

Instead, our interest is to provide servers with a polite way of rejecting
expressions which are not useful (see point (g) below).

>The designers of such subsets seem to
> come up with a wide variety of different solutions to this problem. This
> variety can only confuse users.

The Query Schema Description (Section 5) provides (or tries to provide) a
way for advertising this variety. Additional elements might be included in
the query schema description if that helps on this purpose.

> It also makes it less likely that an
> implementor can take an existing XPath implementation and reuse it, which
> by the law of unintended consequences actually increases costs for
> implementors. Despite the difficulty of finding a rational basis for
> deciding which features to include in a subset and which to exclude, I
> think there is something to be said for having an XPath 2.0 subset
> defined by the responsible W3C working groups (XSL and XQuery)
> and then strongly discouraging other groups from defining their own
> subsets.

All features from XPath 2.0 MAY be supported, since none of them is actually
forbidden. If implementors think that reusing a full featured XPath
component fits their requirements, they are able to do so. On the other
hand, if they consider that such component might involve expensive
operations or storage, they are allowed to drop some optional features.

The subsetting conforms appendix F of XPath 2.0, since the syntactic or
semantic definitions of XPath are not modified (e.g., I say that queries MAY
fail if some numeric predicates are specified, because they are out of the
minimal subset"; however this does not alter element ordering, and does not
modify the semantic of numeric predicates).

OPTIONAL features (as defined in draft-godoy-webdav-xmlsearch) are to be
understood as in RFC 2119, Section 5 (definition of keywords "MAY" and
"OPTIONAL"):
"An implementation which does not include a particular option MUST be
   prepared to interoperate with another implementation which does
   include the option, though perhaps with reduced functionality. In the
   same vein an implementation which does include a particular option
   MUST be prepared to interoperate with another implementation which
   does not include the option (except, of course, for the feature the
   option provides.) "

Such interoperation may be insuficient as proposed in the current version of
my draft. Again, the Query Schema Description should be augmented.

---------


> Now some detailed technical points:
>
> (e) an implementation that does not support descendant,
> descendant-or-self, or "//" is going to be pretty unusable.
> Searching for elements at
> arbitrary depth is a great user convenience, and is essential in the case
> of
> recursive document structures. If you're going to make some of the axes
> optional, I
> suggest you choose the same subset as XQuery chose.

The difference wrt XQuery required axes is that it includes the descendant
and descendant-or-self axes (namespace is deprecated in XQuery, while i'm
not sure about inheriting such deprecation.)

In my draft, the rationale for making descendant and descendant-or-self
optional was that they don't add expression power if elements occur at a
well-known position within the tree.
The "//" abbreviation was made optional because of the descendant-or-self
axis.

Thinking carefully about this point, it seems that either:
 - there will be elements at arbitrary depth (quite possible). In this case
descendant and descendant-or-self would be convenient for selecting them.
 - if there are no elements at arbitrary depth (e.g., a structure defined by
a very simple schema or DTD) it would be easy to implementors to optimize
the query by other means.

For instance, if we have:
<!ELEMENT metadata (title, author+, comment*) >
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (firstname, lastname)>
<!ELEMENT firstname (#PCDATA) >
<!ELEMENT lastname (#PCDATA) >

the expression "//lastname" would only refer to "/metadata/author/lastname"
and could be optimized in that way.

It seems "descendant" and "descendant-or-self" should be REQUIRE, since
there is no advantage in making them OPTIONAL. The "//" abbreviation would
be allowed too.

---------


> (f) you define the minimum set of functions that an implementation must
> supply as being empty (no functions). There are some functions such as
> not() and count() that I would consider absolutely indispensible.

Agree. The minimum function signatures should be revised.

---------


> (g) I don't think the restrictions you propose for numeric predicates
> assist with either of your design objectives (reduced implementation
> cost, throttled performance). They just make the language less
> orthogonal and less interoperable.

The idea behind this requirement was facilitating content to be stored in an
"optimized" form (maybe a relational database or anyother
implementation-dependent solution).

IMHO, The impact of supporting numeric predicates depends on which kind of
sequence they apply. For instance, supporting numeric predicates in AxisExpr
selecting the child axis only requires some indexes, whose size is
proportional  to the number of children. On the other hand, supporting these
predicates in AxisExpr selecting the descendant axis would require either
calculating the element position on-the-fly, or storing an index which is
proportional in size to the number of (ancestor,descendant) pairs. If the
numeric predicate applies to a FilterExpr, then indexes may not help since
many different FilterExpr are allowed and it would be overwhelming to index
all of them.

Maybe the restriction is too drastic, but... what is the meaning of the
i-eth element within a sequence which is not semantically ordered? (besides
that such element is well-defined because element ordering is well-defined)

For instance, if we have
<!ELEMENT metadata (title, author+, comment*) >
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (firstname, lastname)>
<!--author elements are orderer, more relevant author first-->
<!ELEMENT comment (#PCDATA)>
<!--ordering of comment elements is not significative-->

"/metadata/author[1]" is meaningful, while "/metadata/comment[1]" is not
(while both are valid XPath expressions).

I MAY forget everything about the comment order when storing information in
my "optimized storage", because it is not required by the application
context. Why should I be forbidden to do so, only because it is required by
XPath?
(I'm not meaning it is an XPath fault, it is only that full-featured XPath
is too expressive for my hypothetical simple schema)


Regards,

Javier
Received on Tuesday, 22 January 2008 05:12:13 UTC