XParse Module

1 Status of this document

This document is in an early draft stage. Comments are welcomed at public-expath@w3.org mailing list (archive).

2 Introduction

This module is designed to specify a small library of functions for parsing expressions in XPath ([XPath3.0]) or XQuery ([XQuery3.0]) into XML trees, and in the reverse direction serialising these trees into equivalent string expressions.

2.1 Supported languages

There are two sense of languages supported by this API definition:

The languages for which expression data is supported. In this case it is expected to be defined for various (all?) versions of the following language families: XPath ([XPath1.0],[XPath2.0],[XPath3.0]) and XQuery ([XQuery1.0],[XQuery3.0]).
It is anticipated that implementors will use tools to generate portions of code semi-automatically from EBNF descriptions of the language grammars, and thus the suite of supported languages, especially in terms of succeeding versions, will be extensible. For further details see 6 Extensibility & Enquiry.
The languages within which the functions can be executed. At present the model of describing options uses a map, which restricts its use to XPath3.0+, XQuery3.0+ and XSLT3.0+

2.2 Namespace conventions

The module defined by this document defines several functions, all contained in the namespace http://expath.org/ns/xparse. In this document, the xp prefix, when used, is bound to this namespace URI.

Error codes are defined in the same namespace (http://expath.org/ns/xparse), and in this document are displayed with the same prefix, xp.

2.3 Error management

There are two principal types of error: those resulting from erronious arguments or useage and those resulting from invalid primary inputs (e.g. an invalid XPath expression). Both types of failure will generate an error identified by a code (a QName.) When such an error condition is reached in the evaluation of an expression, a dynamic error is thrown, with the corresponding error code (as if the standard XPath function error() had been called.). An implementation may provide additional information about the exact nature of the error, that can for example be processed in an XSLT 3.0 try/catch construct.

2.4 Parse trees

Parse trees within this module are represented as XML trees, with element names corresponding to the EBNF grammar productions for the version of XPath/XQuery being parsed. For example:

<AdditiveExpr>
   <IntegerLiteral>1</IntegerLiteral>
   <TOKEN>+</TOKEN>
   <IntegerLiteral>5</IntegerLiteral>
</AdditiveExpr>

is a subtree corresponding to the string section 1 + 5.

Currently these trees are in the 'null' namespace, but this might be set parametrically.

2.5 Test suite

A suite of test-cases for all the functions defined in this module, in [QT3] format, is defined at [Test-suite].

3 Use cases

Development of this specification was driven by requirements which some XML developers encounter in examining, modifying or generating XSLT or XQuery programs, when XPath and XQuery expressions need to be be parsed or generated. Some typical use cases include:

Refactoring variable names whose references are buried in deep and complex XPath expressions where the use of regular expressions can be hazardous.
Analysing complex properties, such as static type or streamability, independently of a specific compiler.

3.1 Example – Refactoring a variable name

Within XPath and XQuery variable references are denoted through interpolations of the $name form. In simple cases, without local variable definitions or the character $ present in string literals, regular expressions can be used to extract and/or modify such variable references relatively cheaply, but in general this must be performed in a suitable parse tree.

Suppose an external variable $anyOldName is being renamed to $someNewVariableName. If we assume that this name is not overridden in the XPath, then the following XSLT code should suffice::

<xsl:template match="VarRef/@name">
  <xsl:param name="old" as="xs:string" tunnel="yes"/>
  <xsl:param name="new" as="xs:string" tunnel="yes"/>
  <xsl:attribute name="name" select="if(. = $old) then $new else ."/>
</xsl:template>

<xsl:variable name="expr">$anyOldName to string-length('$anyOldName is a variable')</xsl:variable>
<xsl:variable name="parse" select="xp:parse.xpath($expr,map{})"/>
 ==>
   <RangeExpr>
      <VarRef name="anyOldName"/>
      <FunctionCall name="string-length">
        <StringLiteral>$anyOldName is a variable</StringLiteral>
      </FunctionCall>
   </RangeExpr>
   
<xsl:variable name="replace">
  <xsl:apply-templates select="$parse">
    <xsl:with-param name="old" as="xs:string" tunnel="yes">anyOldName</xsl:with-param>
    <xsl:with-param name="new" as="xs:string" tunnel="yes">someNewVariableName</xsl:with-param>
  </xsl:apply-templates> 
</xsl:variable>
 ==>
   <RangeExpr>
      <VarRef name="someNewVariableName"/>
      <FunctionCall name="string-length">
        <StringLiteral>$anyOldName is a variable</StringLiteral>
      </FunctionCall>
   </RangeExpr>
 
<xsl:variable name="new.expr" select="xp:to-expression($replace)"/>
 ==> "$someNewVariableName to string-length('$anyOldName is a variable')"

If local variable bindings are present (using let or for or quantified expressions) then suitable updating of the replacement parameters within push templates can support correct use of variable reference scoping.

4 Options

The functions are controlled by parametric options in a map, with the following permitted entries:

Option	Type	Values	Default	Notes
lang	xs:string	XPath\|XQuery	XPath	The language to be parsed (case insensitive)
version	xs:double	2.0\|3.0	3.0	The version of the language to be parsed
flatten	xs:boolean		true()	Flatten the parse tree - see 5 Flattening the parse tree

5 Flattening the parse tree

The full parse trees for even the simplest expressions are VERY deep, consisting of all the nested productions (and the recognised tokens, including whitespace) that have to match to complete the parse. For example the XPath range expression '1 to 8' produces a complete parse tree with 47 elements of the following form:

<Expr>
  <ExprSingle>
    <OrExpr>
      <AndExpr>
        <ComparisonExpr>
          <RangeExpr>
            <AdditiveExpr>
              <MultiplicativeExpr>
                <UnionExpr>
                  <IntersectExceptExpr>
                    <InstanceofExpr>
                      <TreatExpr>
                        <CastableExpr>
                          <CastExpr>
                            <UnaryExpr>
                              <ValueExpr>
                                <PathExpr>
                                  <RelativePathExpr>
                                    <StepExpr>
                                      <FilterExpr>
                                        <PrimaryExpr>
                                          <Literal>
                                            <NumericLiteral>
                                              <IntegerLiteral>1</IntegerLiteral>
                                            </NumericLiteral>
                                          </Literal>
                                        </PrimaryExpr>
                          ..... close tags ....
               </MultiplicativeExpr>
             </AdditiveExpr>
             <TOKEN>to</TOKEN>
             <AdditiveExpr>
               <MultiplicativeExpr>
                          ... similar ...
                                        <PrimaryExpr>
                                          <Literal>
                                            <NumericLiteral>
                                              <IntegerLiteral>8</IntegerLiteral>
                                            </NumericLiteral>
                                          </Literal>
                                        </PrimaryExpr>
                          ..... close tags ....
               </MultiplicativeExpr>
             </AdditiveExpr>
           </RangeExpr>
         </ComparisonExpr>
       </AndExpr>
     </OrExpr>
   </ExprSingle>
 </Expr>

Whilst these full trees contain all the information and can be manipulated perfect well 'as is', they i) contain much effective redundancy and ii) are obviously expensive to copy. Equally well they make human inspection difficult, with the essential components (such as the RangeExpr and its effective arguments) buried deep in long narrow trees. To be useful less verbose forms would help, provided they still retain all information that will make the parses of two different expressions distinguishable.

There are a number of possibilities of controlling this verbosity without losing information:

Replacing elements with just a single element child, with that child (or the result of reduction of that child)
Removing redundant tokens, i.e. tokens whose presence is implicit in the production
Collapsing significant tokens into attributes
Specialist treatment of literals

As an example, using the first action would reduce the parse expression to four elements:

<RangeExpr>
  <IntegerLiteral>1</IntegerLiteral>
  <TOKEN>to</TOKEN>
  <IntegerLiteral>8</IntegerLiteral>
</RangeExpr>

Applying the second action, since the token 'to' is required for all RangeExpr, would reduce this to the three irreducible elements:

<RangeExpr>
  <IntegerLiteral>1</IntegerLiteral>
  <IntegerLiteral>8</IntegerLiteral>
</RangeExpr>

In both these cases inversion to a text string will yield (subject to whitespace-normalization) the same string as the input.

Whilst some tokens are implict within the production, others bear significant information, especially in cases where the productions contain token alternatives. For example a reduced parse of 1+5 to 8 could be:

<RangeExpr>
  <AdditiveExpr>
     <IntegerLiteral>1</IntegerLiteral>
     <TOKEN>+</TOKEN>
     <IntegerLiteral>5</IntegerLiteral>
  </AdditiveExpr>
  <TOKEN>to</TOKEN>
  <IntegerLiteral>8</IntegerLiteral>
</RangeExpr>

The token to is redundant (apart from being a required toekn for recognition of a RangeExpr, but the + certainly is not, as it differentiates between addition and subtraction expressions. The third type of reduction can subsume such information-bearing token values into attributes, typically op="value". For our example the minimal form would be:

<RangeExpr>
  <AdditiveExpr op="+">
     <IntegerLiteral>1</IntegerLiteral>           
     <IntegerLiteral>5</IntegerLiteral>
  </AdditiveExpr>          
  <IntegerLiteral>8</IntegerLiteral>
</RangeExpr>

Within XPath, Literal productions come in several flavours, each of which ultimately end with a string serialisation of the literal value which would be represented as a text node within the XML. An alternative is to collapse the Literal production into a constant form, whose value type is described as an attribute and whose (serialized) value is held inan appropriate attribute value. Hence an double literal can be represented by any of the following three forms:

<Literal>
  <NumericLiteral>
    <DoubleLiteral>3.14159</DoubleLiteral>
  </NumericLiteral>
</Literal>

<DoubleLiteral>3.14159</DoubleLiteral>

<Literal type="xs:double" value="3.14159"/>

The consistent identification of a Literal, as a self-contained single element without children, can make certain forms of manipulation much more concise, avoiding the use of unions.

6 Extensibility & Enquiry

The languages being parsed are not static constructs and are being updated and extended, albeit in a highly controlled, rigorous and defined manner. As such we should anticipate support for as-yet-unknown versions of the languages to be required.

There will need to be some mechanisms to be able to query what languages and versions can be parsed from a given implementation.

Equally well the framework might be capable of supporting parsing for other (user-supplied) languages, whose grammars are defined in the EBNF form used within the XML standards world. This needs a great deal of thought and probably will require the generation of parsing functions from the grammar for efficiency.

7 Parsing expressions

Most use will be made generating parse trees from an expression. The following functions support this:

7.1 xp:parse

Summary

The xp:parse function returns a parse tree for an expression.

Signatures

xp:parse($in as xs:string) as element()

xp:parse($in as xs:string, $options as map(*)) as element()

Rules

Returns a parse-tree for the XPath/XQuery expression given in $in.

$options is a map of control options as described in 4 Options.

In the absence of any options, the default values are used.

Error Conditions

[xp:parsing-error] is raised if $in cannot be parsed successfully against the given grammar. It is implementation (and language) dependent as to what and how further information is made available about this failure.

[xp:invalid-option] is raised if one or more of the option values within $options is invalid or unrecognised.

Notes

Some notes

Examples

Parsing a simple range expression, with options to flatten single trees and remove unnecessary tokens:

xp:parse('1 to 8',map{'flatten':true(),'version':2.0})
==>
 <RangeExpr>
   <Literal type="xs:integer" value="1"/>
   <Literal type="xs:integer" value="8"/>
 </RangeExpr>

8 Generating expressions

Most use will be made to generate expressions from a (modified) parse tree. The following functions support this:

8.1 xp:to-expression

Summary

The xp:to-expression function returns an expression tree corresponding to a given parse tree.

Signatures

xp:to-expression($in as element()) as xs:string

xp:to-expression($in as element(), $options as map(*)) as xs:string

Rules

Returns a string for the XPath/XQuery expression tree given in $in.

$options is a map of control options as described in 4 Options.

In the absence of any options, the default values are used.

Error Conditions

[xp:invalid-tree] is raised if $in is not a valid expression tree, or suitable reduced subtree, for the grammar requested. It is implementation (and language) dependent as to what and how further information is made available about this failure.

[xp:invalid-option] is raised if one or more of the option values within $options is invalid or unrecognised.

Notes

Some notes

Examples

Testing whether $data variable starts with binary content consistent with a PDF file:

<xsl:variable name="in">
  <RangeExpr>
    <AdditiveExpr op="-">
      <Literal type="xs:integer" value="4"/>            
      <Literal type="xs:integer" value="2"/>
    </AdditiveExpr>
    <Literal type="xs:integer" value="8"/>
  </RangeExpr>
</xsl:variable>
<xsl:value-of select="xp:to-expression($in)"/>
 ==> "4 - 2 to 8"

XParse Module

EXPath Candidate Module 8 December 2014

Abstract

Table of Contents

Appendices

1 Status of this document

2 Introduction

2.1 Supported languages

2.2 Namespace conventions

2.3 Error management

2.4 Parse trees

2.5 Test suite

3 Use cases

3.1 Example – Refactoring a variable name

4 Options

5 Flattening the parse tree

6 Extensibility & Enquiry

7 Parsing expressions

7.1 xp:parse

8 Generating expressions

8.1 xp:to-expression

A References

B Summary of error conditions