XPath and find/findAll methods from Liam R E Quin on 2011-11-29 (public-webapps@w3.org from October to December 2011)

From: Liam R E Quin <liam@w3.org>
Date: Tue, 29 Nov 2011 00:33:16 -0500
To: public-webapps@w3.org
Message-ID: <1322544796.22171.163.camel@desktop.barefootcomputing.com>
Wearing my XML Activity Lead hat, I want to give some information that
may help people decide here.  The actual answer isn't my concern, but
only that it's based on clear information.

(0) XPath

XPath is a language for selecting from XML (or HTML or SGML) document
trees.  It is used by some other specs, including XML Schema and XSLT,
and it's extended by XQuery.  XPath is very widely used in the XML
world, e.g. in servers and on desktops and in shoes :-)  I've lost track
of the number of implementations of XPath, even though there are just a
few dozen major implementations of course.

XPath is popular because it has a regular syntax that's easy to learn
and a good fit for XML.


(1) XPath 1, 2 and 3 compatibility

XPath 2 is backwards compatible with XPath 1; there _are_ some very
minor differences, most of which would not affect Web browsers at all
because they depend on DTDs or Schemas.

Similarly, an XSLT 2 engine will interpret XSLT 1 transformations.
There are some exceptions listed in the Backward Compatibility section
of the XSLT 2 spec, but they are very minor.


(2) Not a dead end

XSLT 1 and XPath 1 are not "evolutionary dead ends" although it's true
that neither the xt nor the libxml2 library supports XSLT 2 and XPath 2.
There's some support (along with XQuery) in the Qt libraries, and also
in C++ with XQilla and Zorba.  There are maybe 50 implementations of
XPath 2 and/or XQuery 2 that I've encountered.  XQuery 3.0 and XPath 3.0
are about to go to Last Call, we hope, and XSLT 3.0 to follow next year.
The work is very much active and alive.


(3) XPath and efficiency

XPath can be implemented very efficiently.

In most cases in practice, O(1) or O(log n) can be achieved. Some of the
techniques modern XPath libraries use are also used by Web browsers for
CSS selectors - e.g. keep an index of elements, and evaluate from the
right-hand end (most specific) or start with whichever element occurs
the fewest times.

There are implementations of XQuery (an extension of XPath) being used
with petabytes of XML data.  That is not to say you couldn't also use
CSS selectors on petabytes of data -- it's not an either-or or a battle
between the two languages.  XQuery response times are generally measured
in milliseconds, although, as with SQL or JavaScript or Java or C, you
can write infinite loops :-)

The trick is to notice that there are idioms in XPath that can be
optimised much more easily than the corresponding JavaScript code.
There have been some papers at VLDB on XPath and XQuery optimization.

XPath was written with efficiency and optimization in mind, and drew on
implementation experience.


(4) XMLness

XPath 2 is actually defined in terms of a data model, and can work over
non-XML sources - e.g. not just HTML and XML, but also relational data
and anything else that can be represented usefully with a similar data
model.  It lacks arrays and hashes/maps, which makes JSON support
somewhat inconvenient, but there are people working on extendng XPath to
handle JSON more gracefully (e.g. via "JSONIQ").


(5) Orthogonality

I think this was mentioned in discussions but may not've been clear.

In general, XPath is an expression language - anywhere you can have an
expression, you can have any expression, and the expressions all work
together.  For example, predicates can contain any XPath expression,
recursively:
    /html/body/div/p[@id = /html/head/link[@rel = 'me']/@src]/strong

This is all "strong" elements in p elements that are direct children of
div elements that are direct children of the body element, and whose p
parent has an "d" attribute that has the same value as the src attribute
of a link element in the head which has rel="me".  (this is a
microformat-style query on a document, of course)

XPath selectors give a different way of looking at finding things than
CSS selectors and probably appeal in differing amounts to different
people.


(6) Note on History

Not really important today, but someone mentioned it, so I'll note that
XPath came out of SGML and (later) HyTime work dating back long before
the World Wide Web and CSS; that work really ended with the publication
of DSSSL and HyTime, but many of the same people were (and in some cases
are) involved with XML and XPath.

XPath has different goals from CSS selectors, and there's not actually a
battle between them

XSLT and XQuery are widely used on the "back end" of Web apps, and less
often in the browser, but in some environments the browser-based support
can be very useful, depending on the division of labour.

(7) XPath Selectors and CSS Selectors

There's a huge overlap of functionality. Some claims were made based on
misunderstandings (in both directions probably, but I can only correct
the ones about XPath)...

"XPath can't handle things like :hover or :first-line" -- not true.
XPath has a mechanism by which a browser would support them, using a
functional notation:
    //a[hover()]
It's perfectly standard for an implementation to add functions. You can
add them in your own namespace if you're prepared to accept more syntax:
    //a[css:hover()]
for example.  This wouldn't make sense in most XPath environments (e.g.
inside SQL or Java) so it's left for an extension function, but that
seems reasonable to me.

In the other direction, there's no reason in principle why you couldn't
have a CSS selector that looked something like
xpath("some xpath expression here") {
   CSS properties here
}
if you wanted to.  Maybe this would relieve CSS selectors from the
burden of solving some relatively complex use cases.


(8) note on the suggested API

Martin Kadlec proposed
    findAll(query, use_xpath):
    CSS: findAll("nav a:first-child");
    XPATH: findAll("//nav/a[1]", true);

I'd suggest rather,
   findAll(expression [, language [, version]])
so as to be able to support languages other than CSS selectors or
XPath in the future - e.g. XQuery or linq or dart or whatever - and to
be able to say which version of those languages was the minimum needed.

Better (and more JavaScriptIsh) might be a factory that returned a
function that would evaluate XPath queries against a given DOM.
   xqe = document.makeDocumentQueryEngine("XPath", "1.0");
   xqe->query("//nav/a[1]);

as then the returned function could also take optional arguments, e.g. a
prefix/URI object to handle namespaces in an XML DOM.

It seems useful to me especially since the XPath engine is already in
the browsers, even if it's a pretty old version of XPath.

But that's my opinion. Really, I just want people to have some more
information about XPath.

Sorry for a long message - I hope this is helpful.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Received on Tuesday, 29 November 2011 05:34:48 UTC