Constructing Sequences in XSLT

Hi,

Following is a proposal for constructing sequences in XSLT rather than
XPath, for your consideration. Let me know if anything is unclear.

[It differs slightly from the draft posted on XSL-List, mainly in
 sections 6, 7, 8 and 9.]

Cheers,

Jeni

----

Executive summary
-----------------

Rather than XPath being continuously extended to allow it to do what
XSLT can already do, XSLT should be modified to support the thing that
it can't already do: sequence construction. This could be achieved by
amending the definition of content constructors in XSLT 2.0 and
introducing a new xsl:item instruction. This change would make XSLT
more consistent and more usable.


Contents
--------

1.  Requirement
2.  Sequence constructors
3.  Producing simple typed values and existing nodes
4.  Impact on XPath
5.  Impact on function definitions
6.  Impact on variable bindings
7.  Parentless (documentless) nodes
8.  Impact on result tree generation
9.  Creating node trees
10. Conclusions
11. References


Requirement
-----------

Recently, David C. posted a message to www-xpath-comments@w3.org that
described how XPath is restricted by the lack of a general
variable-binding expression (let clause) [1].

I think that the lack of a let clause restricts what's practical in
XPath (even if it doesn't affect what's theoretically possible). For
example, with the for expression, you have to reconstruct any sequence
that you create within the for expression each time you use it, which
probably isn't particularly efficient and leads to maintenance
headaches. For example:

  for $o in $orders
  return if (count($o/item[(@price * @quantity) > 100]) > 5)
         then do:something($o/item[(@price * @quantity) > 100])
         else do:something-else($o/item[(@price * @quantity) > 100])

The way around this is with functions, because then you can use
xsl:variable to assign the variable:

  for $o in $orders
  return do:process-items($o)

and:

<xsl:function name="do:process-items">
  <xsl:param name="order" />
  <xsl:variable name="items"
                select="$order/item[(@price * @quantity) > 100]" />
  <xsl:result select="if (count($items) > 5)
                      then do:something($items)
                      else do:something-else($items)" />
</xsl:function>

but it's hardly ideal.

The same kind of problem occurs within an if expression within a for
expression, when certain variables are relevant within one branch of
the if and not in the other. For example:

  if ($string and $keyword)
  then if ((starts-with($string, $keyword) or
            ends-with(substring-before($string, $keyword), ' ')) and
           (not(substring-after($string, $keyword)) or
            starts-with(substring-after($string, $keyword), ' ')))
       then (substring-before($string, $keyword),
             $keyword,
             substring-after($string, $keyword))
       else $string
  else ()

which could be managed with:

  if ($string and $keyword)
  then (for $before in substring-before($string, $keyword),
            $after  in substring-after($string, $keyword)
        return if ((not($before) or ends-with($before, ' ')) and
                   (not($after) or starts-with($after, ' ')))
               then ($before, $keyword, $after)
               else $string
  else ()

but which would be much clearer (and more accurate, since you're not
really iterating) as:

  if ($string and $keyword)
  then (let $before := substring-before($string, $keyword),
            $after  := substring-after($string, $keyword)
        if ((not($before) or ends-with($before, ' ')) and
            (not($after) or starts-with($after, ' ')))
        then ($before, $keyword, $after)
        else $string
  else ()

Again, you could create a function to do the testing, but if we have
to generate new functions every time we want to bind variables, we're
going to have them coming out of our ears.

It's certainly true that you could add a let clause to XPath; you
could also add a where clause... and a sortby clause... and
typeswitches... and even element constructors... but what you end up
with is a replication of all the facilities of XSLT, but using a
non-XML syntax, and stuffed inside XML attributes.


Sequence constructors
--------------------

So I'd like to suggest an alternative. Instead of modifying XPath so
that it can do all the things that XSLT can do plus construct
sequences, why not modify XSLT so that it can construct general
sequences rather than just node sequences?

Doing this is (I *think*) simpler than it sounds. In XSLT 2.0,
"content constructors" are defined as [2]:

  "a sequence of nodes in the stylesheet that, when evaluated,
   constructs and returns a sequence of new nodes suitable for adding
   to the result tree. This sequence is referred to below as the
   result sequence."

If we modify that definition, so that "content constructors" don't
necessarily return *nodes* (they should probably then be called
"sequence constructors"):

   a sequence of nodes in the stylesheet that, when evaluated,
   constructs and returns a sequence. This sequence is referred to
   below as the result sequence.

We can amend the description of XSLT instructions in line with this:

XSLT instructions then produce a sequence of zero, one, or more items
as their result. These items are added to the result sequence. Some
instructions, such as xsl:element, return a newly-constructed node
(which may have its own attributes, namespaces, children, and other
descendants); others, such as xsl:if, return items produced by their
own nested sequence constructors.

[There are a couple of incompatibility problems here that I think can
 be handled; I'll come on to those later.]


Producing simple typed values and existing nodes
------------------------------------------------
 
All we need now is an element that can add a simple typed value or an
existing node to the result sequence. This could be achieved with an
xsl:item element:

  <!-- Category: instruction -->
  <xsl:item
    select = expression
    type = datatype>
    <!-- Content: sequence-constructor -->
  </xsl:item>

The xsl:item element works similarly to variable-binding elements: it
produces a sequence of items from either its select attribute or its
content. This enables you to add simple typed values or existing nodes
to a sequence.

For example, the equivalent to the for expression that we looked at
earlier would be:

  <xsl:variable name="new-orders" type="item*">
    <xsl:for-each select="$orders">
      <xsl:variable name="items"
                    select="item[(@price * @quantity) > 100]" />
      <xsl:item select="if (count($items) > 5)
                        then do:something($items)
                        else do:something-else($items)" />
    </xsl:for-each>
  </xsl:variable>

The $new-orders variable would have a value of a sequence of items.


Impact on XPath
---------------

Enabling XSLT to generate sequences will remove the requirement for
XPath to support expressions that involve range variables. For
example:

  <xsl:variable name="join" type="xs:integer*"
                select="for $i in (1, 2),
                            $j in (3, 4)
                        return ($i, $j)" />

could be done with:

  <xsl:variable name="join" type="xs:integer*">
    <xsl:for-each select="(1, 2)">
      <xsl:variable name="i" select="." />
      <xsl:for-each select="(3, 4)">
        <xsl:variable name="j" select="." />
        <xsl:item select="($i, $j)" />
      </xsl:for-each>
    </xsl:for-each>
  </xsl:variable>

[Of course a syntax for simple mapping would still be useful for
 when you just need to convert one sequence into another.]
  
This change would also remove the requirement for the sort() function
(from XSLT, and indeed named sort specifications altogether) or the
adoption of the sortby clause from XQuery, since the existing xsl:sort
can be used.

For example, instead of:

  <xsl:sort-key name="subtotal-sort">
    <xsl:sort select="@price * @quantity" data-type="number"
              order="descending" />
    <xsl:sort select="@part-id" order="ascending" />
  </xsl:sort-key>
  <xsl:variable name="sorted-items"
                select="sort($items, 'subtotal-sort')" />

you could do:

  <xsl:variable name="sorted-items">
    <xsl:for-each select="$items">
      <xsl:sort select="@price * @quantity" data-type="number"
                order="descending" />
      <xsl:sort select="@part-id" order="ascending" />
      <xsl:item select="." />
    </xsl:for-each>
  </xsl:variable>


Impact on function definitions
------------------------------

Adding the xsl:item element allows us to get rid of the xsl:result
element when defining functions. The xsl:function element's new syntax
would be:

<xsl:function
  name = qname
  type = datatype>
  <!-- Content: (xsl:param*, sequence-constructor) -->
</xsl:function>

The xsl:function element would simply return the sequence produced by
its content constructor.

For example:

  <xsl:function name="my:split-string" type="xs:string*">
    <xsl:param name="string" type="xs:string" />
    <xsl:param name="keyword" type="xs:string" />
    <xsl:if test="$string and $keyword">
      <xsl:variable name="before"
                    select="substring-before($string, $keyword)" />
      <xsl:variable name="after"
                    select="substring-after($string, $keyword)" />
      <xsl:item select="if (not($before) or ends-with($before, ' ')) and
                           (not($after) or starts-with($after, ' '))
                        then ($before, $keyword, $after)
                        else $string" />
    </xsl:if>
  </xsl:function>


Impact on variable bindings
---------------------------

The current XSLT 2.0 WD states:

  "[ERR030] Elements such as xsl:variable, xsl:param, xsl:message,
   and xsl:result-document construct a new document node, and use the
   result sequence returned by the content constructor to form the
   children of this document node. In this case it is an dynamic error
   if the result sequence contains namespace or attribute nodes. The
   processor must either signal the error, or must recover by ignoring
   the offending nodes. The elements, comments, processing
   instructions, and text nodes in the node sequence form the children
   of the newly constructed document node."

I'll concentrate on variable-binding elements here (xsl:message and
xsl:result-document are discussed in the next section).

Supporting the creation of sequences means that rather than create a
new document node, variable-binding elements must bind the variable to
the result sequence produced by their sequence constructor. This
sequence must be able to contain all kinds of nodes.

There is a backwards incompatibility here - if a variable is assigned
a value through the content of the variable-binding element, then
rather than conceptually holding the "root node of the result tree
fragment" as in XSLT 1.0, the variable holds a sequence of items
(nodes, assuming you're using the variable as in XSLT 1.0).

Currently, when users get the string value of a result tree fragment,
they get the string value of the *root node* of the result tree
fragment - the concatenation of the string values of the text node
descendants in the result tree fragment.

On the other hand, when users get the string value of a sequence, they
get the string value of the first item in the sequence.

Therefore if you have:

  <xsl:variable name="foo">
    <element>A</element>
    <element>B</element>
  </xsl:variable>

then string($foo) will give "AB" in XSLT 1.0 and just "A" in XSLT 2.0
(if sequence constructors were supported).

[I don't think that people get the string values of result tree
 fragments that contain elements very often but it's sometimes useful.]

Another difference applies if people are used to using node-set()
extension functions to convert variables to node sets. As there is no
document node, addressing the items in the sequence does not involve
stepping down to them.

For example, given the above definition of $foo, the equivalent of the
following in XSLT 1.0:

  <xsl:for-each select="exsl:node-set($foo)/element">
    ...
  </xsl:for-each>

is simply:

  <xsl:for-each select="$foo">
    ...
  </xsl:for-each>

[There's an argument that XSLT 2.0 shouldn't have to worry about
 backwards compatibility with extension functions, but the node-set()
 extension function is very widely used and is based on the
 description of result tree fragments from XSLT 1.0.]
 
These backwards compatibility issues could be resolved by having the
type attribute on the variable-binding element determine the behaviour
of the variable-binding element. If the type attribute is not present,
or if the type attribute indicates that the variable should contain a
single document node, then the variable-binding element creates a
result tree (as described later), and the variable is bound to a new
document node; otherwise, the variable is bound to the sequence.

[This is similar to the role played by the separator attribute on
 xsl:value-of.]


Parentless (Documentless) nodes
-------------------------------

Section 3.1 of the XSLT 2.0 WD [3] states:

  "The data model defined in [Data Model] allows a node to be part of
   a tree whose root is a node other than a document node.

  "Although such nodes may exist transiently during the course of XSLT
   processing, every node that is processed by an XSLT stylesheet
   (that is, a node that may be returned in the result of an
   expression) will belong to a tree whose root is a document node."

Under the scheme described above, this would no longer be true. It
would be possible to create sequences containing nodes that do not have
a parent.

I think that it would sometimes be handy to allow documentless nodes
to be generated by a sequence constructor, for example to dynamically
create a set of attributes that can then be added to several different
elements.

[Currently in XSLT you have to do this by creating the attributes on a
 dummy element; attribute sets don't help if an attribute should only
 be present under certain circumstances.]

However, it may create problems with parentless attributes nodes,
since they cannot gain access to namespace nodes through their parent
element. I think that this is sufficiently rare that it's not
particularly worrisome; in the worst case, it could be an error to
have a sequence contain parentless attribute nodes.

Note that if the suggestion for retaining backwards compatibility with
variable-binding elements is used, then if XSLT 2.0 is used like XSLT
1.0 (i.e. without type attributes on variable-binding elements, and
without user-defined functions) it is still true that every node that
may be returned in the result of an expression will belong to a tree
whose root is a document node.


Impact on result tree generation
--------------------------------

Document nodes are generated automatically in four places in XSLT 2.0
as defined:

  - within variable-binding elements
  - within xsl:message
  - within xsl:result-document
  - within the stylesheet as a whole

The sequence generated from the content constructor forms the children
of the document node. With xsl:result-document, the href attribute
gives instructions about where that document should go. The
destination and format of the document node generated by the
stylesheet as a whole can be indicated by xsl:destination, or
implicit. The other document nodes (from variable-binding elements and
xsl:message) don't have an explicit destination - I'll call these
anonymous documents.

If we generalise to sequence constructors, the role of
xsl:result-document is similar to that of xsl:element - it creates a
node and uses its sequence constructor to create the content of that
node. If you view it like this, I think that xsl:document is the more
appropriate name (because it ties in with xsl:element etc.). I also
think that you should be able to explicitly create anonymous
documents.

Assuming that anonymous documents could be created explicitly using
xsl:document, The handling of an anonymous document created in this
way depends on where the sequence containing the anonymous document is
produced:

  - if it's produced from the content of a variable-binding element,
    then the variable is bound to that document node (actually the
    sequence that includes that document node, since feasibly other
    document nodes could be generated as well)

  - if it's produced from the content of an xsl:message, then the
    document is written to an implementation-defined destination for
    error messages (e.g. stderr)

  - if it's produced from the stylesheet as a whole, then the document
    is written to an implementation-defined destination for the result
    of the transformation (e.g. stdout) or the destination indicated
    by the xsl:destination element.

Note that it should always be a dynamic error if there's more than one
anonymous document in a sequence.

For backwards compatability with XSLT 1.0 (and for convenience), if
the result sequence consists of documentless nodes, an anonymous
document should be implicitly created in certain circumstances:

  - by variable-binding elements, if they don't have a type attribute
    or have a type attribute with the value "document" (or whatever
    DataType expression is used to indicate a document node)

  - by xsl:message

  - by the stylesheet as a whole

Allowing xsl:document would enable you to create sequences that
contained several document nodes (with an error if any of those
document nodes had the same destination). It would also potentially
allow you to create sequences that mixed new document nodes and other
items.

This could be an error, such that sequences should either consist
entirely of document nodes (with different destinations), or consist
entirely of documentless nodes. If it was an error, and you wanted to
generate multiple documents, you'd need to use xsl:document to create
the main document as well as the secondary ones.

Also, you wouldn't be able to construct a document node while you were
in the middle of constructing another document. This is a very
different model from the 'tree of documents' approach of the current
XSLT 2.0 WD, the XSLT 1.1 WD and most extension elements. I'm not sure
whether this restriction makes it impractical (or any more impractical
than the current restriction that you can't create a secondary result
document within a variable).

It could also mean additional processing because, for example, you
couldn't run through a bunch of nodes, creating a secondary result
document for each node at the same time as creating a link in the main
result document. You'd have to run through the same set of nodes twice
in order to create the two different bits of content.

On the other hand, that restriction (you can only do one thing at a
time) is true elsewhere in XSLT, so why shouldn't it be true when it
comes to creating documents?

Alternatively, it could be permitted to mix document nodes and other
nodes (after all, it should be allowed to mix document nodes from the
source documents with other nodes in node sequences). This would make
node-tree construction (see below) a little more complex, but I think
it could be handled.


Creating node trees
-------------------

This final issue is about how to create content to be added to other
nodes from a sequence. This applies to the construction of the content
of element nodes and document nodes (as described above). It also
applies, slightly differently, to the construction of comment,
attribute, processing instruction, text and namespace nodes (which
I'll call simple nodes so that I don't have to repeat their names
constantly).

Currently, content constructors construct a sequence of nodes, and
this sequence of nodes can be made into a node tree by adding a parent
node, or converted to a string to be used as the value of a simple
node. Under certain circumstances, the presence of certain types of
nodes in the node sequence is a recoverable dynamic error (e.g.
attribute nodes when creating a document; element nodes when getting
the string value for an attribute).

If we had the more general sequence constructors, result trees would
need to be constructed from sequences containing any mixture of simple
typed values and nodes (both newly created (rootless) and pre-existing
(rooted)), rather than those containing just newly created nodes.

In fact, this is exactly the same issue as that faced by xsl:copy-of
(which also has to cope with sequences containing a mixture of types
of items in order to create a sequence of (new) nodes). The only
difference is that under the proposals above, the sequence could
contain documentless nodes and (potentially) document nodes.

In some cases, documentless nodes may be added to the node tree simply
by giving them a parent. However, this cannot be done all the time
since a variable may still hold a reference to the node; giving it a
parent would change the result of counting its ancestors, for example.
In addition, the documentless node might be added to two different
parents, which would cause problems.

The options, I think, are:

  - copy documentless nodes (as you do with nodes that have documents)

  - make it an error for a variable to hold a sequence of documentless
    nodes (in most cases such sequences will be automatically
    converted to a document node whose content is that sequence)

Since I think that sequences of documentless nodes could be useful, I
favour the first option.

Document nodes are more tricky. If they are allowed in these
situations at all, then I think there needs to be some way of
'bubbling up' document nodes so that in the end you get a sequence of
document nodes.  For example, the result of:

  <xsl:element name="foo">
    <xsl:document><xsl:call-template name="bar" /></xsl:document>
  </xsl:element>

would actually be a sequence containing the foo element node followed
by the document node, the equivalent of:

  <xsl:element name="foo" />
  <xsl:document><xsl:call-template name="bar" /></xsl:document>


Conclusions
-----------

If XPath were extended to be a usable method of generating sequences,
it would end up replicating the variable assignment and flow control
features that are already available within XSLT. While there is an
argument for constructing a language that performs transformations
without using XML syntax, that niche is already filled by XQuery. In
addition, because XPaths are used within attributes in XSLT, XSLT with
extended XPath will become a lot harder to read, write, and maintain
than the equivalent XSLT instructions.

Extending the concept of 'content constructors' to more general
'sequence constructors' and introducing an xsl:item element to add
simple typed values and pre-existing nodes to this sequence gives XSLT
the power to construct sequences of all descriptions. Rather than
learning one language for constructing sequences of nodes and a
different language with similar constructs for constructing other
sequences, users will only have to learn one, unified, language.


References
----------

[1] http://lists.w3.org/Archives/Public/www-xpath-comments/2002JanMar/0026.html
[2] http://www.w3.org/TR/xslt20/#dt-content-constructor
[3] http://www.w3.org/TR/xslt20/#rootless-nodes

Received on Friday, 11 January 2002 04:30:48 UTC