- From: Jeni Tennison <jeni@jenitennison.com>
- Date: Fri, 11 Jan 2002 09:30:44 +0000
- To: xsl-editors@w3.org
Hi, Following is a proposal for constructing sequences in XSLT rather than XPath, for your consideration. Let me know if anything is unclear. [It differs slightly from the draft posted on XSL-List, mainly in sections 6, 7, 8 and 9.] Cheers, Jeni ---- Executive summary ----------------- Rather than XPath being continuously extended to allow it to do what XSLT can already do, XSLT should be modified to support the thing that it can't already do: sequence construction. This could be achieved by amending the definition of content constructors in XSLT 2.0 and introducing a new xsl:item instruction. This change would make XSLT more consistent and more usable. Contents -------- 1. Requirement 2. Sequence constructors 3. Producing simple typed values and existing nodes 4. Impact on XPath 5. Impact on function definitions 6. Impact on variable bindings 7. Parentless (documentless) nodes 8. Impact on result tree generation 9. Creating node trees 10. Conclusions 11. References Requirement ----------- Recently, David C. posted a message to www-xpath-comments@w3.org that described how XPath is restricted by the lack of a general variable-binding expression (let clause) [1]. I think that the lack of a let clause restricts what's practical in XPath (even if it doesn't affect what's theoretically possible). For example, with the for expression, you have to reconstruct any sequence that you create within the for expression each time you use it, which probably isn't particularly efficient and leads to maintenance headaches. For example: for $o in $orders return if (count($o/item[(@price * @quantity) > 100]) > 5) then do:something($o/item[(@price * @quantity) > 100]) else do:something-else($o/item[(@price * @quantity) > 100]) The way around this is with functions, because then you can use xsl:variable to assign the variable: for $o in $orders return do:process-items($o) and: <xsl:function name="do:process-items"> <xsl:param name="order" /> <xsl:variable name="items" select="$order/item[(@price * @quantity) > 100]" /> <xsl:result select="if (count($items) > 5) then do:something($items) else do:something-else($items)" /> </xsl:function> but it's hardly ideal. The same kind of problem occurs within an if expression within a for expression, when certain variables are relevant within one branch of the if and not in the other. For example: if ($string and $keyword) then if ((starts-with($string, $keyword) or ends-with(substring-before($string, $keyword), ' ')) and (not(substring-after($string, $keyword)) or starts-with(substring-after($string, $keyword), ' '))) then (substring-before($string, $keyword), $keyword, substring-after($string, $keyword)) else $string else () which could be managed with: if ($string and $keyword) then (for $before in substring-before($string, $keyword), $after in substring-after($string, $keyword) return if ((not($before) or ends-with($before, ' ')) and (not($after) or starts-with($after, ' '))) then ($before, $keyword, $after) else $string else () but which would be much clearer (and more accurate, since you're not really iterating) as: if ($string and $keyword) then (let $before := substring-before($string, $keyword), $after := substring-after($string, $keyword) if ((not($before) or ends-with($before, ' ')) and (not($after) or starts-with($after, ' '))) then ($before, $keyword, $after) else $string else () Again, you could create a function to do the testing, but if we have to generate new functions every time we want to bind variables, we're going to have them coming out of our ears. It's certainly true that you could add a let clause to XPath; you could also add a where clause... and a sortby clause... and typeswitches... and even element constructors... but what you end up with is a replication of all the facilities of XSLT, but using a non-XML syntax, and stuffed inside XML attributes. Sequence constructors -------------------- So I'd like to suggest an alternative. Instead of modifying XPath so that it can do all the things that XSLT can do plus construct sequences, why not modify XSLT so that it can construct general sequences rather than just node sequences? Doing this is (I *think*) simpler than it sounds. In XSLT 2.0, "content constructors" are defined as [2]: "a sequence of nodes in the stylesheet that, when evaluated, constructs and returns a sequence of new nodes suitable for adding to the result tree. This sequence is referred to below as the result sequence." If we modify that definition, so that "content constructors" don't necessarily return *nodes* (they should probably then be called "sequence constructors"): a sequence of nodes in the stylesheet that, when evaluated, constructs and returns a sequence. This sequence is referred to below as the result sequence. We can amend the description of XSLT instructions in line with this: XSLT instructions then produce a sequence of zero, one, or more items as their result. These items are added to the result sequence. Some instructions, such as xsl:element, return a newly-constructed node (which may have its own attributes, namespaces, children, and other descendants); others, such as xsl:if, return items produced by their own nested sequence constructors. [There are a couple of incompatibility problems here that I think can be handled; I'll come on to those later.] Producing simple typed values and existing nodes ------------------------------------------------ All we need now is an element that can add a simple typed value or an existing node to the result sequence. This could be achieved with an xsl:item element: <!-- Category: instruction --> <xsl:item select = expression type = datatype> <!-- Content: sequence-constructor --> </xsl:item> The xsl:item element works similarly to variable-binding elements: it produces a sequence of items from either its select attribute or its content. This enables you to add simple typed values or existing nodes to a sequence. For example, the equivalent to the for expression that we looked at earlier would be: <xsl:variable name="new-orders" type="item*"> <xsl:for-each select="$orders"> <xsl:variable name="items" select="item[(@price * @quantity) > 100]" /> <xsl:item select="if (count($items) > 5) then do:something($items) else do:something-else($items)" /> </xsl:for-each> </xsl:variable> The $new-orders variable would have a value of a sequence of items. Impact on XPath --------------- Enabling XSLT to generate sequences will remove the requirement for XPath to support expressions that involve range variables. For example: <xsl:variable name="join" type="xs:integer*" select="for $i in (1, 2), $j in (3, 4) return ($i, $j)" /> could be done with: <xsl:variable name="join" type="xs:integer*"> <xsl:for-each select="(1, 2)"> <xsl:variable name="i" select="." /> <xsl:for-each select="(3, 4)"> <xsl:variable name="j" select="." /> <xsl:item select="($i, $j)" /> </xsl:for-each> </xsl:for-each> </xsl:variable> [Of course a syntax for simple mapping would still be useful for when you just need to convert one sequence into another.] This change would also remove the requirement for the sort() function (from XSLT, and indeed named sort specifications altogether) or the adoption of the sortby clause from XQuery, since the existing xsl:sort can be used. For example, instead of: <xsl:sort-key name="subtotal-sort"> <xsl:sort select="@price * @quantity" data-type="number" order="descending" /> <xsl:sort select="@part-id" order="ascending" /> </xsl:sort-key> <xsl:variable name="sorted-items" select="sort($items, 'subtotal-sort')" /> you could do: <xsl:variable name="sorted-items"> <xsl:for-each select="$items"> <xsl:sort select="@price * @quantity" data-type="number" order="descending" /> <xsl:sort select="@part-id" order="ascending" /> <xsl:item select="." /> </xsl:for-each> </xsl:variable> Impact on function definitions ------------------------------ Adding the xsl:item element allows us to get rid of the xsl:result element when defining functions. The xsl:function element's new syntax would be: <xsl:function name = qname type = datatype> <!-- Content: (xsl:param*, sequence-constructor) --> </xsl:function> The xsl:function element would simply return the sequence produced by its content constructor. For example: <xsl:function name="my:split-string" type="xs:string*"> <xsl:param name="string" type="xs:string" /> <xsl:param name="keyword" type="xs:string" /> <xsl:if test="$string and $keyword"> <xsl:variable name="before" select="substring-before($string, $keyword)" /> <xsl:variable name="after" select="substring-after($string, $keyword)" /> <xsl:item select="if (not($before) or ends-with($before, ' ')) and (not($after) or starts-with($after, ' ')) then ($before, $keyword, $after) else $string" /> </xsl:if> </xsl:function> Impact on variable bindings --------------------------- The current XSLT 2.0 WD states: "[ERR030] Elements such as xsl:variable, xsl:param, xsl:message, and xsl:result-document construct a new document node, and use the result sequence returned by the content constructor to form the children of this document node. In this case it is an dynamic error if the result sequence contains namespace or attribute nodes. The processor must either signal the error, or must recover by ignoring the offending nodes. The elements, comments, processing instructions, and text nodes in the node sequence form the children of the newly constructed document node." I'll concentrate on variable-binding elements here (xsl:message and xsl:result-document are discussed in the next section). Supporting the creation of sequences means that rather than create a new document node, variable-binding elements must bind the variable to the result sequence produced by their sequence constructor. This sequence must be able to contain all kinds of nodes. There is a backwards incompatibility here - if a variable is assigned a value through the content of the variable-binding element, then rather than conceptually holding the "root node of the result tree fragment" as in XSLT 1.0, the variable holds a sequence of items (nodes, assuming you're using the variable as in XSLT 1.0). Currently, when users get the string value of a result tree fragment, they get the string value of the *root node* of the result tree fragment - the concatenation of the string values of the text node descendants in the result tree fragment. On the other hand, when users get the string value of a sequence, they get the string value of the first item in the sequence. Therefore if you have: <xsl:variable name="foo"> <element>A</element> <element>B</element> </xsl:variable> then string($foo) will give "AB" in XSLT 1.0 and just "A" in XSLT 2.0 (if sequence constructors were supported). [I don't think that people get the string values of result tree fragments that contain elements very often but it's sometimes useful.] Another difference applies if people are used to using node-set() extension functions to convert variables to node sets. As there is no document node, addressing the items in the sequence does not involve stepping down to them. For example, given the above definition of $foo, the equivalent of the following in XSLT 1.0: <xsl:for-each select="exsl:node-set($foo)/element"> ... </xsl:for-each> is simply: <xsl:for-each select="$foo"> ... </xsl:for-each> [There's an argument that XSLT 2.0 shouldn't have to worry about backwards compatibility with extension functions, but the node-set() extension function is very widely used and is based on the description of result tree fragments from XSLT 1.0.] These backwards compatibility issues could be resolved by having the type attribute on the variable-binding element determine the behaviour of the variable-binding element. If the type attribute is not present, or if the type attribute indicates that the variable should contain a single document node, then the variable-binding element creates a result tree (as described later), and the variable is bound to a new document node; otherwise, the variable is bound to the sequence. [This is similar to the role played by the separator attribute on xsl:value-of.] Parentless (Documentless) nodes ------------------------------- Section 3.1 of the XSLT 2.0 WD [3] states: "The data model defined in [Data Model] allows a node to be part of a tree whose root is a node other than a document node. "Although such nodes may exist transiently during the course of XSLT processing, every node that is processed by an XSLT stylesheet (that is, a node that may be returned in the result of an expression) will belong to a tree whose root is a document node." Under the scheme described above, this would no longer be true. It would be possible to create sequences containing nodes that do not have a parent. I think that it would sometimes be handy to allow documentless nodes to be generated by a sequence constructor, for example to dynamically create a set of attributes that can then be added to several different elements. [Currently in XSLT you have to do this by creating the attributes on a dummy element; attribute sets don't help if an attribute should only be present under certain circumstances.] However, it may create problems with parentless attributes nodes, since they cannot gain access to namespace nodes through their parent element. I think that this is sufficiently rare that it's not particularly worrisome; in the worst case, it could be an error to have a sequence contain parentless attribute nodes. Note that if the suggestion for retaining backwards compatibility with variable-binding elements is used, then if XSLT 2.0 is used like XSLT 1.0 (i.e. without type attributes on variable-binding elements, and without user-defined functions) it is still true that every node that may be returned in the result of an expression will belong to a tree whose root is a document node. Impact on result tree generation -------------------------------- Document nodes are generated automatically in four places in XSLT 2.0 as defined: - within variable-binding elements - within xsl:message - within xsl:result-document - within the stylesheet as a whole The sequence generated from the content constructor forms the children of the document node. With xsl:result-document, the href attribute gives instructions about where that document should go. The destination and format of the document node generated by the stylesheet as a whole can be indicated by xsl:destination, or implicit. The other document nodes (from variable-binding elements and xsl:message) don't have an explicit destination - I'll call these anonymous documents. If we generalise to sequence constructors, the role of xsl:result-document is similar to that of xsl:element - it creates a node and uses its sequence constructor to create the content of that node. If you view it like this, I think that xsl:document is the more appropriate name (because it ties in with xsl:element etc.). I also think that you should be able to explicitly create anonymous documents. Assuming that anonymous documents could be created explicitly using xsl:document, The handling of an anonymous document created in this way depends on where the sequence containing the anonymous document is produced: - if it's produced from the content of a variable-binding element, then the variable is bound to that document node (actually the sequence that includes that document node, since feasibly other document nodes could be generated as well) - if it's produced from the content of an xsl:message, then the document is written to an implementation-defined destination for error messages (e.g. stderr) - if it's produced from the stylesheet as a whole, then the document is written to an implementation-defined destination for the result of the transformation (e.g. stdout) or the destination indicated by the xsl:destination element. Note that it should always be a dynamic error if there's more than one anonymous document in a sequence. For backwards compatability with XSLT 1.0 (and for convenience), if the result sequence consists of documentless nodes, an anonymous document should be implicitly created in certain circumstances: - by variable-binding elements, if they don't have a type attribute or have a type attribute with the value "document" (or whatever DataType expression is used to indicate a document node) - by xsl:message - by the stylesheet as a whole Allowing xsl:document would enable you to create sequences that contained several document nodes (with an error if any of those document nodes had the same destination). It would also potentially allow you to create sequences that mixed new document nodes and other items. This could be an error, such that sequences should either consist entirely of document nodes (with different destinations), or consist entirely of documentless nodes. If it was an error, and you wanted to generate multiple documents, you'd need to use xsl:document to create the main document as well as the secondary ones. Also, you wouldn't be able to construct a document node while you were in the middle of constructing another document. This is a very different model from the 'tree of documents' approach of the current XSLT 2.0 WD, the XSLT 1.1 WD and most extension elements. I'm not sure whether this restriction makes it impractical (or any more impractical than the current restriction that you can't create a secondary result document within a variable). It could also mean additional processing because, for example, you couldn't run through a bunch of nodes, creating a secondary result document for each node at the same time as creating a link in the main result document. You'd have to run through the same set of nodes twice in order to create the two different bits of content. On the other hand, that restriction (you can only do one thing at a time) is true elsewhere in XSLT, so why shouldn't it be true when it comes to creating documents? Alternatively, it could be permitted to mix document nodes and other nodes (after all, it should be allowed to mix document nodes from the source documents with other nodes in node sequences). This would make node-tree construction (see below) a little more complex, but I think it could be handled. Creating node trees ------------------- This final issue is about how to create content to be added to other nodes from a sequence. This applies to the construction of the content of element nodes and document nodes (as described above). It also applies, slightly differently, to the construction of comment, attribute, processing instruction, text and namespace nodes (which I'll call simple nodes so that I don't have to repeat their names constantly). Currently, content constructors construct a sequence of nodes, and this sequence of nodes can be made into a node tree by adding a parent node, or converted to a string to be used as the value of a simple node. Under certain circumstances, the presence of certain types of nodes in the node sequence is a recoverable dynamic error (e.g. attribute nodes when creating a document; element nodes when getting the string value for an attribute). If we had the more general sequence constructors, result trees would need to be constructed from sequences containing any mixture of simple typed values and nodes (both newly created (rootless) and pre-existing (rooted)), rather than those containing just newly created nodes. In fact, this is exactly the same issue as that faced by xsl:copy-of (which also has to cope with sequences containing a mixture of types of items in order to create a sequence of (new) nodes). The only difference is that under the proposals above, the sequence could contain documentless nodes and (potentially) document nodes. In some cases, documentless nodes may be added to the node tree simply by giving them a parent. However, this cannot be done all the time since a variable may still hold a reference to the node; giving it a parent would change the result of counting its ancestors, for example. In addition, the documentless node might be added to two different parents, which would cause problems. The options, I think, are: - copy documentless nodes (as you do with nodes that have documents) - make it an error for a variable to hold a sequence of documentless nodes (in most cases such sequences will be automatically converted to a document node whose content is that sequence) Since I think that sequences of documentless nodes could be useful, I favour the first option. Document nodes are more tricky. If they are allowed in these situations at all, then I think there needs to be some way of 'bubbling up' document nodes so that in the end you get a sequence of document nodes. For example, the result of: <xsl:element name="foo"> <xsl:document><xsl:call-template name="bar" /></xsl:document> </xsl:element> would actually be a sequence containing the foo element node followed by the document node, the equivalent of: <xsl:element name="foo" /> <xsl:document><xsl:call-template name="bar" /></xsl:document> Conclusions ----------- If XPath were extended to be a usable method of generating sequences, it would end up replicating the variable assignment and flow control features that are already available within XSLT. While there is an argument for constructing a language that performs transformations without using XML syntax, that niche is already filled by XQuery. In addition, because XPaths are used within attributes in XSLT, XSLT with extended XPath will become a lot harder to read, write, and maintain than the equivalent XSLT instructions. Extending the concept of 'content constructors' to more general 'sequence constructors' and introducing an xsl:item element to add simple typed values and pre-existing nodes to this sequence gives XSLT the power to construct sequences of all descriptions. Rather than learning one language for constructing sequences of nodes and a different language with similar constructs for constructing other sequences, users will only have to learn one, unified, language. References ---------- [1] http://lists.w3.org/Archives/Public/www-xpath-comments/2002JanMar/0026.html [2] http://www.w3.org/TR/xslt20/#dt-content-constructor [3] http://www.w3.org/TR/xslt20/#rootless-nodes
Received on Friday, 11 January 2002 04:30:48 UTC