CSS Selectors to XPath expressions from Bjoern Hoehrmann on 2007-09-27 (www-archive@w3.org from September 2007)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Thu, 27 Sep 2007 06:13:00 +0200
To: www-archive@w3.org
Message-ID: <49amf3dopll9mpnd2fhuk83jsud05otsv6@hive.bjoern.hoehrmann.de>
Hi,

  (I wrote this many moons ago, it's incomplete and there are a few
errors below, but I've been asked to post this somewhere; maybe some-
one corrects and completes it. It's the most complete analysis of this
kind that I know of.)

  In the following I will describe the transformation of a CSS Selector
into an equivalent XPath expression. This is a top-down process, and it
cannot be applied to all Selectors. The exceptions will be pointed out
below. Note further that the expressions generated here are not opti-
mized in any way, neither of the input selector nor of the resulting
XPath expression. A range of possible optimizations can be applied to
the result, research material on this matter is readily available.

It is assumed that the Selector only uses characters that can occur in
the position they are being used in the context of XML documents. It is,
for example, possible to construct a selector that matches on elements
that include the character U+0001 in their name. This is not allowed in
XML documents and as such not in XPath expressions. Even though the
transformation described herein does not generate node tests that would
be affected by this, the character U+0001 cannot occur in literals in
XPath expressions either.

In XPath strings are matched case-sensitively; in Selectors, in some
cases, strings are matched case-insensitively. Telling the difference
requires static knowledge of the artifact that is being matched. It is
assumed that such knowledge is not available.

Similarily, selectors that rely on information not included in the XPath
data model cannot be transformed. This applies to the pseudo-classes
:checked, :enabled, :disabled, :target, :focus, :hover, :active, :link,
and :visited. It would be possible to define extension functions that
allow these to be represented, but this is out of scope of this document
and the transformation would be straight-forward in any case.

The :...-of-type pseudo-classes cannot be transformed if they are bound
to a subject for which the local-name or namespace-name is not known,
for example, *|*:nth-of-type(3) cannot be transformed. Naturally, if an
implementation evaluates the expression for each node in the tree, it
could generate an expression for each particular node, but this is not
the design goal in this document.

The XPath id() function is used to transform the #id selector; the two
specifications have incompatible requirements regarding duplicate IDs.
While it would be possible to use other means than id() if IDness could
be externally determined, this is considered out of scope.

Class selectors are language-specific, in many cases the ...

Pseudo-elements do not occur in the XPath data model and as such are
ignored by the transformation. A selector with a pseudo-element is
transformed into an expression that corresponds to the selector without
the pseudo-element.

Strings in selectors are assumed to exclude the ' character. The '
character is used to delimit strings in the XPath expressions and as
such cannot contain the ' character since unlike Selectors XPath does
not support character escapes. It is, however, possible to transform
any given string into a concat(...) expression that represents the
original expression, so [foo="\"'"] -> [ @foo = concat('"', "'") ].

First, we transform some selectors into equivalent selectors. These
reduce the number of transformations to be performed later. It should
be noted that this pre-processing generally increases the complexity
of the resulting XPath expressions. For example, :only-child is just
"count(../*)=1" normally, and would yield a very long expression with
this pre-processing applied. As explained above, the processes defined
in this document are not optimized.

  :only-child
    -> :first-child:last-child

  :first-child
    -> :nth-child(1)

  :last-child
    -> :nth-last-child(1)

  :only-of-type
    -> :first-of-type:last-of-type

  :first-of-type
    -> :nth-of-type(1)

  :last-of-type
    -> nth-last-of-type(1)

We start with "//*" which matches any element and define predicates.

...

  :root

    not(parent::*)

  :nth-child(an+b)
  :nth-last-child(an+b)
  x|y:nth-of-type(an+b)
  x|y:nth-last-of-type(an+b)

    These are transformed using the following template:

      ((_a = 0 and (count(_DIR-sibling::*[_Y]) + 1) = _b) or
      (_a > 0 and not((count(_DIR-sibling::*[_Y]) + 1) < _b) and
        (((count(_DIR-sibling::*[_Y]) + 1) - _b) mod _a) = 0) or
      (_a < 0 and not((count(_DIR-sibling::*[_Y]) + 1) > _b) and
        ((_b - (count(_DIR-sibling::*[_Y]) + 1)) mod -1*_a) = 0))
      and parent::*

    _a and _b are derived from the an+b expression in the selector.
    _Y is the predicate derived from the type selector x|y in case
    of :nth-of-type and :nth-last-of-type, and "true()" otherwise.
    _DIR is "preceding" for :nth-child and :nth-of-type, and
    "following" for :nth-last-child and :nth-last-of-type.

    Note that in case of :...-of-type the local name and the name-
    space name have to be defined.

  :empty

    not(* or text())

  :not(s)

    not(self::*[_s])

    _s is the predicate derived from s.

    Care must be taken when parsing s and handling namespaces, the
    default namespace does not apply to s unless s is a type or
    universal selector.

  [x|y]
  [|y]
  [y]

    @*[ namespace-uri() = _x ][ local-name() = _y ]

    _x is the namespace name associated with the prefix x. If the
    attribute in the selector is in no namespace, this is ''. _y
    is the local name y.

  [*|y]

    @*[ local-name() = _y ]

    _y is as specified above. The following transformations add
    predicates, namespace handling is therefore ignored, as are
    other syntactic details like use of strings versus identifiers.

  [x=y]

    . = _y

  [x~=y]

    not(contains(normalize-space(_y), ' ')) and
    ( . = _y or
      starts-with(normalize-space(.), concat(_y, ' ')) or
      contains(normalize-space(.), concat(' ', _y, ' ')) or
      substring(normalize-space(.),
                string-length(normalize-space(.)) + 1 -
                  string-length(concat(' ', _y))) = concat(' ', _y))

    @@ This actually does not seem right. Perhaps it should be

      contains(concat(' ', normalize-space(.), ' '), _y)

  [x|=y]

    starts-with(., _y) or starts-with(., concat(_y, '-'))

  [x^=y]

    starts-with(., _y)

  [x$=y]

    substring(., string-length(.) - string-length(_y) + 1) = _y

  [x*=y] 

    contains(., _y)

  #y

    . = //id(_y)

  x|y
  * 

    Type selectors and universal selectors are transformed into 

      [ namespace-uri() = _x ][ local-name() = _y ]

    where _x and _y are the namespace name and local name respectively,
    and, where undefined, one or both of the predicates are omitted.

Even longer ago I wrote a partial implementation of the translation, 

  * http://perl-css.cvs.sourceforge.net/perl-css/CSS-SAC/lib/CSS/SAC/Selector/ToXPath.pm

It handles some of the combinators not yet discussed here.

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Thursday, 27 September 2007 04:13:10 UTC