W3C

XParse Module

EXPath Candidate Module 8 December 2014

This version:
http://expath.org/spec/xparse/20141006
Latest version:
http://expath.org/spec/xparse
Previous version:
Editor:
John Lumley <john@saxonica.com>

This document is also available in these non-normative formats: XML and Revision Markup.


Abstract

This proposal provides an API for parsing XPath and XQuery expressions to a standardised parse tree, and the reversal of the process. It has been designed to be compatible with XQuery 3.0+ and XSLT 3.0+, as well as any other XPath 3.0 usage.

The module homepage, with more information, is on the EXPath website at http://expath.org/modules/xparse/.

Table of Contents

1 Status of this document
2 Introduction
    2.1 Supported languages
    2.2 Namespace conventions
    2.3 Error management
    2.4 Parse trees
    2.5 Test suite
3 Use cases
    3.1 Example – Refactoring a variable name
4 Options
5 Flattening the parse tree
6 Extensibility & Enquiry
7 Parsing expressions
    7.1 xp:parse
8 Generating expressions
    8.1 xp:to-expression

Appendices

A References
B Summary of error conditions


1 Status of this document

This document is in an early draft stage. Comments are welcomed at public-expath@w3.org mailing list (archive).

2 Introduction

This module is designed to specify a small library of functions for parsing expressions in XPath ([XPath3.0]) or XQuery ([XQuery3.0]) into XML trees, and in the reverse direction serialising these trees into equivalent string expressions.

2.1 Supported languages

There are two sense of languages supported by this API definition:

  • The languages for which expression data is supported. In this case it is expected to be defined for various (all?) versions of the following language families: XPath ([XPath1.0],[XPath2.0],[XPath3.0]) and XQuery ([XQuery1.0],[XQuery3.0]).

    It is anticipated that implementors will use tools to generate portions of code semi-automatically from EBNF descriptions of the language grammars, and thus the suite of supported languages, especially in terms of succeeding versions, will be extensible. For further details see 6 Extensibility & Enquiry.

  • The languages within which the functions can be executed. At present the model of describing options uses a map, which restricts its use to XPath3.0+, XQuery3.0+ and XSLT3.0+

2.2 Namespace conventions

The module defined by this document defines several functions, all contained in the namespace http://expath.org/ns/xparse. In this document, the xp prefix, when used, is bound to this namespace URI.

Error codes are defined in the same namespace (http://expath.org/ns/xparse), and in this document are displayed with the same prefix, xp.

2.3 Error management

There are two principal types of error: those resulting from erronious arguments or useage and those resulting from invalid primary inputs (e.g. an invalid XPath expression). Both types of failure will generate an error identified by a code (a QName.) When such an error condition is reached in the evaluation of an expression, a dynamic error is thrown, with the corresponding error code (as if the standard XPath function error() had been called.). An implementation may provide additional information about the exact nature of the error, that can for example be processed in an XSLT 3.0 try/catch construct.

2.4 Parse trees

Parse trees within this module are represented as XML trees, with element names corresponding to the EBNF grammar productions for the version of XPath/XQuery being parsed. For example:

<AdditiveExpr>
   <IntegerLiteral>1</IntegerLiteral>
   <TOKEN>+</TOKEN>
   <IntegerLiteral>5</IntegerLiteral>
</AdditiveExpr>

is a subtree corresponding to the string section 1 + 5.

Currently these trees are in the 'null' namespace, but this might be set parametrically.

2.5 Test suite

A suite of test-cases for all the functions defined in this module, in [QT3] format, is defined at [Test-suite].

3 Use cases

Development of this specification was driven by requirements which some XML developers encounter in examining, modifying or generating XSLT or XQuery programs, when XPath and XQuery expressions need to be be parsed or generated. Some typical use cases include:

3.1 Example – Refactoring a variable name

Within XPath and XQuery variable references are denoted through interpolations of the $name form. In simple cases, without local variable definitions or the character $ present in string literals, regular expressions can be used to extract and/or modify such variable references relatively cheaply, but in general this must be performed in a suitable parse tree.

Suppose an external variable $anyOldName is being renamed to $someNewVariableName. If we assume that this name is not overridden in the XPath, then the following XSLT code should suffice::

<xsl:template match="VarRef/@name">
  <xsl:param name="old" as="xs:string" tunnel="yes"/>
  <xsl:param name="new" as="xs:string" tunnel="yes"/>
  <xsl:attribute name="name" select="if(. = $old) then $new else ."/>
</xsl:template>

<xsl:variable name="expr">$anyOldName to string-length('$anyOldName is a variable')</xsl:variable>
<xsl:variable name="parse" select="xp:parse.xpath($expr,map{})"/>
 ==>
   <RangeExpr>
      <VarRef name="anyOldName"/>
      <FunctionCall name="string-length">
        <StringLiteral>$anyOldName is a variable</StringLiteral>
      </FunctionCall>
   </RangeExpr>
   
<xsl:variable name="replace">
  <xsl:apply-templates select="$parse">
    <xsl:with-param name="old" as="xs:string" tunnel="yes">anyOldName</xsl:with-param>
    <xsl:with-param name="new" as="xs:string" tunnel="yes">someNewVariableName</xsl:with-param>
  </xsl:apply-templates> 
</xsl:variable>
 ==>
   <RangeExpr>
      <VarRef name="someNewVariableName"/>
      <FunctionCall name="string-length">
        <StringLiteral>$anyOldName is a variable</StringLiteral>
      </FunctionCall>
   </RangeExpr>
 
<xsl:variable name="new.expr" select="xp:to-expression($replace)"/>
 ==> "$someNewVariableName to string-length('$anyOldName is a variable')"

If local variable bindings are present (using let or for or quantified expressions) then suitable updating of the replacement parameters within push templates can support correct use of variable reference scoping.

4 Options

The functions are controlled by parametric options in a map, with the following permitted entries:

OptionTypeValuesDefaultNotes
langxs:stringXPath|XQueryXPathThe language to be parsed (case insensitive)
versionxs:double2.0|3.03.0The version of the language to be parsed
flattenxs:booleantrue()Flatten the parse tree - see 5 Flattening the parse tree

5 Flattening the parse tree

The full parse trees for even the simplest expressions are VERY deep, consisting of all the nested productions (and the recognised tokens, including whitespace) that have to match to complete the parse. For example the XPath range expression '1 to 8' produces a complete parse tree with 47 elements of the following form:

<Expr>
  <ExprSingle>
    <OrExpr>
      <AndExpr>
        <ComparisonExpr>
          <RangeExpr>
            <AdditiveExpr>
              <MultiplicativeExpr>
                <UnionExpr>
                  <IntersectExceptExpr>
                    <InstanceofExpr>
                      <TreatExpr>
                        <CastableExpr>
                          <CastExpr>
                            <UnaryExpr>
                              <ValueExpr>
                                <PathExpr>
                                  <RelativePathExpr>
                                    <StepExpr>
                                      <FilterExpr>
                                        <PrimaryExpr>
                                          <Literal>
                                            <NumericLiteral>
                                              <IntegerLiteral>1</IntegerLiteral>
                                            </NumericLiteral>
                                          </Literal>
                                        </PrimaryExpr>
                          ..... close tags ....
               </MultiplicativeExpr>
             </AdditiveExpr>
             <TOKEN>to</TOKEN>
             <AdditiveExpr>
               <MultiplicativeExpr>
                          ... similar ...
                                        <PrimaryExpr>
                                          <Literal>
                                            <NumericLiteral>
                                              <IntegerLiteral>8</IntegerLiteral>
                                            </NumericLiteral>
                                          </Literal>
                                        </PrimaryExpr>
                          ..... close tags ....
               </MultiplicativeExpr>
             </AdditiveExpr>
           </RangeExpr>
         </ComparisonExpr>
       </AndExpr>
     </OrExpr>
   </ExprSingle>
 </Expr>

Whilst these full trees contain all the information and can be manipulated perfect well 'as is', they i) contain much effective redundancy and ii) are obviously expensive to copy. Equally well they make human inspection difficult, with the essential components (such as the RangeExpr and its effective arguments) buried deep in long narrow trees. To be useful less verbose forms would help, provided they still retain all information that will make the parses of two different expressions distinguishable.

There are a number of possibilities of controlling this verbosity without losing information:

As an example, using the first action would reduce the parse expression to four elements:

<RangeExpr>
  <IntegerLiteral>1</IntegerLiteral>
  <TOKEN>to</TOKEN>
  <IntegerLiteral>8</IntegerLiteral>
</RangeExpr>

Applying the second action, since the token 'to' is required for all RangeExpr, would reduce this to the three irreducible elements:

<RangeExpr>
  <IntegerLiteral>1</IntegerLiteral>
  <IntegerLiteral>8</IntegerLiteral>
</RangeExpr>

In both these cases inversion to a text string will yield (subject to whitespace-normalization) the same string as the input.

Whilst some tokens are implict within the production, others bear significant information, especially in cases where the productions contain token alternatives. For example a reduced parse of 1+5 to 8 could be:

<RangeExpr>
  <AdditiveExpr>
     <IntegerLiteral>1</IntegerLiteral>
     <TOKEN>+</TOKEN>
     <IntegerLiteral>5</IntegerLiteral>
  </AdditiveExpr>
  <TOKEN>to</TOKEN>
  <IntegerLiteral>8</IntegerLiteral>
</RangeExpr>

The token to is redundant (apart from being a required toekn for recognition of a RangeExpr, but the + certainly is not, as it differentiates between addition and subtraction expressions. The third type of reduction can subsume such information-bearing token values into attributes, typically op="value". For our example the minimal form would be:

<RangeExpr>
  <AdditiveExpr op="+">
     <IntegerLiteral>1</IntegerLiteral>           
     <IntegerLiteral>5</IntegerLiteral>
  </AdditiveExpr>          
  <IntegerLiteral>8</IntegerLiteral>
</RangeExpr>

Within XPath, Literal productions come in several flavours, each of which ultimately end with a string serialisation of the literal value which would be represented as a text node within the XML. An alternative is to collapse the Literal production into a constant form, whose value type is described as an attribute and whose (serialized) value is held inan appropriate attribute value. Hence an double literal can be represented by any of the following three forms:

<Literal>
  <NumericLiteral>
    <DoubleLiteral>3.14159</DoubleLiteral>
  </NumericLiteral>
</Literal>

<DoubleLiteral>3.14159</DoubleLiteral>

<Literal type="xs:double" value="3.14159"/>

The consistent identification of a Literal, as a self-contained single element without children, can make certain forms of manipulation much more concise, avoiding the use of unions.

6 Extensibility & Enquiry

The languages being parsed are not static constructs and are being updated and extended, albeit in a highly controlled, rigorous and defined manner. As such we should anticipate support for as-yet-unknown versions of the languages to be required.

There will need to be some mechanisms to be able to query what languages and versions can be parsed from a given implementation.

Equally well the framework might be capable of supporting parsing for other (user-supplied) languages, whose grammars are defined in the EBNF form used within the XML standards world. This needs a great deal of thought and probably will require the generation of parsing functions from the grammar for efficiency.

7 Parsing expressions

Most use will be made generating parse trees from an expression. The following functions support this:

7.1 xp:parse

Summary

The xp:parse function returns a parse tree for an expression.

Signatures

xp:parse($in as xs:string) as element()
xp:parse($in as xs:string, $options as map(*)) as element()

Rules

Returns a parse-tree for the XPath/XQuery expression given in $in.

$options is a map of control options as described in 4 Options.

In the absence of any options, the default values are used.

Error Conditions

[xp:parsing-error] is raised if $in cannot be parsed successfully against the given grammar. It is implementation (and language) dependent as to what and how further information is made available about this failure.

[xp:invalid-option] is raised if one or more of the option values within $options is invalid or unrecognised.

Notes

Some notes

Examples

Parsing a simple range expression, with options to flatten single trees and remove unnecessary tokens:

xp:parse('1 to 8',map{'flatten':true(),'version':2.0})
==>
 <RangeExpr>
   <Literal type="xs:integer" value="1"/>
   <Literal type="xs:integer" value="8"/>
 </RangeExpr>

8 Generating expressions

Most use will be made to generate expressions from a (modified) parse tree. The following functions support this:

8.1 xp:to-expression

Summary

The xp:to-expression function returns an expression tree corresponding to a given parse tree.

Signatures

xp:to-expression($in as element()) as xs:string
xp:to-expression($in as element(), $options as map(*)) as xs:string

Rules

Returns a string for the XPath/XQuery expression tree given in $in.

$options is a map of control options as described in 4 Options.

In the absence of any options, the default values are used.

Error Conditions

[xp:invalid-tree] is raised if $in is not a valid expression tree, or suitable reduced subtree, for the grammar requested. It is implementation (and language) dependent as to what and how further information is made available about this failure.

[xp:invalid-option] is raised if one or more of the option values within $options is invalid or unrecognised.

Notes

Some notes

Examples

Testing whether $data variable starts with binary content consistent with a PDF file:

<xsl:variable name="in">
  <RangeExpr>
    <AdditiveExpr op="-">
      <Literal type="xs:integer" value="4"/>            
      <Literal type="xs:integer" value="2"/>
    </AdditiveExpr>
    <Literal type="xs:integer" value="8"/>
  </RangeExpr>
</xsl:variable>
<xsl:value-of select="xp:to-expression($in)"/>
 ==> "4 - 2 to 8"

A References

XPath1.0
XML Path Language (XPath) Version 1.0. James Clark, Steve DeRose, Editors. World Wide Web Consortium, 16 November 1999.
XPath2.0
XML Path Language (XPath) 2.0 (Second Edition). Don Chamberlin, Jonathan Robie, Anders Berglund, Scott Boag, et. al., Editors. World Wide Web Consortium, 14 December 2010.
XPath3.0
XML Path Language (XPath) 3.0. Jonathan Robie, Don Chamberlin, Michael Dyck, John Snelson, Editors. World Wide Web Consortium, 08 April 2014.
XQuery1.0
XQuery 1.0: An XML Query Language (Second Edition). Don Chamberlin, Jonathan Robie, Anders Berglund, Scott Boag, et. al., Editors. World Wide Web Consortium, 14 December 2010.
XQuery3.0
XQuery 3.0: An XML Query Language. Jonathan Robie, Don Chamberlin, Michael Dyck, John Snelson, Editors. World Wide Web Consortium, 08 April 2014.
IEEE 754-1985
IEEE Standard for Binary Floating-Point Arithmetic. See http://standards.ieee.org/reading/ieee/std_public/description/busarch/754-1985_desc.html
QT3
XML Query Test Suite. W3C 21 June 2013.
Test-suite
The test suite for this module, using QT3 format, is in the EXPath repository http://github.com/expath/expath-cg in the directory tests/qt3/xparse/
XML Schema 1.1 Part 2
W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. David Peterson et al, editors. W3C Recommendation 5 April 2012.

B Summary of error conditions

xp:invalid-option
A value for one of the supplied options is invalid.
xp:invalid-tree
The input expression tree is invalid according to the supplied options.
xp:parsing-error
There was an error in parsing the expression from a string.