This document is also available in these non-normative formats: XML and Revision Markup.
Copyright © 2014 John Lumley, published by the EXPath Community Group under the W3C Community Final Specification Agreement (FSA). A human-readable summary is available.
This specification was published by the EXPath Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Final Specification Agreement (FSA) other conditions apply. Learn more about W3C Community and Business Groups.
This proposal provides an API for parsing XPath and XQuery expressions to a standardised parse tree, and the reversal of the process. It has been designed to be compatible with XQuery 3.0+ and XSLT 3.0+, as well as any other XPath 3.0 usage.
The module homepage, with more information, is on the EXPath website at http://expath.org/modules/xparse/.
1 Status of this document
2 Introduction
2.1 Supported languages
2.2 Namespace conventions
2.3 Error management
2.4 Parse trees
2.5 Test suite
3 Use cases
3.1 Example – Refactoring a variable name
4 Options
5 Flattening the parse tree
6 Extensibility & Enquiry
7 Parsing expressions
7.1 xp:parse
8 Generating expressions
8.1 xp:to-expression
This document is in an early draft stage. Comments are welcomed at public-expath@w3.org mailing list (archive).
This module is designed to specify a small library of functions for parsing expressions in XPath ([XPath3.0]) or XQuery ([XQuery3.0]) into XML trees, and in the reverse direction serialising these trees into equivalent string expressions.
There are two sense of languages supported by this API definition:
The languages for which expression data is supported. In this case it is expected to be defined for various (all?) versions of the following language families: XPath ([XPath1.0],[XPath2.0],[XPath3.0]) and XQuery ([XQuery1.0],[XQuery3.0]).
It is anticipated that implementors will use tools to generate portions of code semi-automatically from EBNF descriptions of the language grammars, and thus the suite of supported languages, especially in terms of succeeding versions, will be extensible. For further details see 6 Extensibility & Enquiry.
The languages within which the functions can be executed. At present the model of
describing options uses a map
, which restricts its use to XPath3.0+,
XQuery3.0+ and XSLT3.0+
The module defined by this document defines several functions, all contained in the
namespace http://expath.org/ns/xparse
. In this document, the xp
prefix, when used, is bound to this namespace URI.
Error codes are defined in the same namespace (http://expath.org/ns/xparse
),
and in this document are displayed with the same prefix, xp
.
There are two principal types of error: those resulting from erronious arguments or
useage and those resulting from invalid primary inputs (e.g. an invalid XPath expression).
Both types of failure will generate an error identified by a code (a QName
.)
When such an error condition is reached in the evaluation of an expression, a dynamic
error is thrown, with the corresponding error code (as if the standard XPath function
error()
had been called.). An implementation may provide additional
information about the exact nature of the error, that can for example be processed in an
XSLT 3.0 try/catch
construct.
Parse trees within this module are represented as XML trees, with element names corresponding to the EBNF grammar productions for the version of XPath/XQuery being parsed. For example:
<AdditiveExpr> <IntegerLiteral>1</IntegerLiteral> <TOKEN>+</TOKEN> <IntegerLiteral>5</IntegerLiteral> </AdditiveExpr>
is a subtree corresponding to the string section 1 + 5
.
Currently these trees are in the 'null' namespace, but this might be set parametrically.
A suite of test-cases for all the functions defined in this module, in [QT3] format, is defined at [Test-suite].
Development of this specification was driven by requirements which some XML developers encounter in examining, modifying or generating XSLT or XQuery programs, when XPath and XQuery expressions need to be be parsed or generated. Some typical use cases include:
Refactoring variable names whose references are buried in deep and complex XPath expressions where the use of regular expressions can be hazardous.
Analysing complex properties, such as static type or streamability, independently of a specific compiler.
Within XPath and XQuery variable references are denoted through interpolations of the
$
name
form. In simple cases, without local
variable definitions or the character $
present in string literals, regular
expressions can be used to extract and/or modify such variable references relatively
cheaply, but in general this must be performed in a suitable parse tree.
Suppose an external variable $anyOldName
is being renamed to
$someNewVariableName
. If we assume that this name is not overridden in the
XPath, then the following XSLT code should suffice::
<xsl:template match="VarRef/@name"> <xsl:param name="old" as="xs:string" tunnel="yes"/> <xsl:param name="new" as="xs:string" tunnel="yes"/> <xsl:attribute name="name" select="if(. = $old) then $new else ."/> </xsl:template> <xsl:variable name="expr">$anyOldName to string-length('$anyOldName is a variable')</xsl:variable> <xsl:variable name="parse" select="xp:parse.xpath($expr,map{})"/> ==> <RangeExpr> <VarRef name="anyOldName"/> <FunctionCall name="string-length"> <StringLiteral>$anyOldName is a variable</StringLiteral> </FunctionCall> </RangeExpr> <xsl:variable name="replace"> <xsl:apply-templates select="$parse"> <xsl:with-param name="old" as="xs:string" tunnel="yes">anyOldName</xsl:with-param> <xsl:with-param name="new" as="xs:string" tunnel="yes">someNewVariableName</xsl:with-param> </xsl:apply-templates> </xsl:variable> ==> <RangeExpr> <VarRef name="someNewVariableName"/> <FunctionCall name="string-length"> <StringLiteral>$anyOldName is a variable</StringLiteral> </FunctionCall> </RangeExpr> <xsl:variable name="new.expr" select="xp:to-expression($replace)"/> ==> "$someNewVariableName to string-length('$anyOldName is a variable')"
If local variable bindings are present (using let
or for
or
quantified expressions) then suitable updating of the replacement parameters within push
templates can support correct use of variable reference scoping.
The functions are controlled by parametric options in a map, with the following permitted entries:
Option | Type | Values | Default | Notes |
---|---|---|---|---|
lang | xs:string | XPath|XQuery | XPath | The language to be parsed (case insensitive) |
version | xs:double | 2.0|3.0 | 3.0 | The version of the language to be parsed |
flatten | xs:boolean | true() | Flatten the parse tree - see 5 Flattening the parse tree |
The full parse trees for even the simplest expressions are VERY deep, consisting of all the
nested productions (and the recognised tokens, including whitespace) that have to match to
complete the parse. For example the XPath range expression '1 to 8
' produces a
complete parse tree with 47 elements of the following form:
<Expr> <ExprSingle> <OrExpr> <AndExpr> <ComparisonExpr> <RangeExpr> <AdditiveExpr> <MultiplicativeExpr> <UnionExpr> <IntersectExceptExpr> <InstanceofExpr> <TreatExpr> <CastableExpr> <CastExpr> <UnaryExpr> <ValueExpr> <PathExpr> <RelativePathExpr> <StepExpr> <FilterExpr> <PrimaryExpr> <Literal> <NumericLiteral> <IntegerLiteral>1</IntegerLiteral> </NumericLiteral> </Literal> </PrimaryExpr> ..... close tags .... </MultiplicativeExpr> </AdditiveExpr> <TOKEN>to</TOKEN> <AdditiveExpr> <MultiplicativeExpr> ... similar ... <PrimaryExpr> <Literal> <NumericLiteral> <IntegerLiteral>8</IntegerLiteral> </NumericLiteral> </Literal> </PrimaryExpr> ..... close tags .... </MultiplicativeExpr> </AdditiveExpr> </RangeExpr> </ComparisonExpr> </AndExpr> </OrExpr> </ExprSingle> </Expr>
Whilst these full trees contain all the information and can be manipulated perfect well 'as
is', they i) contain much effective redundancy and ii) are obviously expensive to copy.
Equally well they make human inspection difficult, with the essential components (such as
the RangeExpr
and its effective arguments) buried deep in long narrow trees. To
be useful less verbose forms would help, provided they still retain all
information that will make the parses of two different expressions distinguishable.
There are a number of possibilities of controlling this verbosity without losing information:
Replacing elements with just a single element child, with that child (or the result of reduction of that child)
Removing redundant tokens, i.e. tokens whose presence is implicit in the production
Collapsing significant tokens into attributes
Specialist treatment of literals
As an example, using the first action would reduce the parse expression to four elements:
<RangeExpr> <IntegerLiteral>1</IntegerLiteral> <TOKEN>to</TOKEN> <IntegerLiteral>8</IntegerLiteral> </RangeExpr>
Applying the second action, since the token 'to' is required for all
RangeExpr
, would reduce this to the three irreducible elements:
<RangeExpr> <IntegerLiteral>1</IntegerLiteral> <IntegerLiteral>8</IntegerLiteral> </RangeExpr>
In both these cases inversion to a text string will yield (subject to whitespace-normalization) the same string as the input.
Whilst some tokens are implict within the production, others bear significant information,
especially in cases where the productions contain token alternatives. For example a reduced
parse of 1+5 to 8
could be:
<RangeExpr> <AdditiveExpr> <IntegerLiteral>1</IntegerLiteral> <TOKEN>+</TOKEN> <IntegerLiteral>5</IntegerLiteral> </AdditiveExpr> <TOKEN>to</TOKEN> <IntegerLiteral>8</IntegerLiteral> </RangeExpr>
The token to
is redundant (apart from being a required toekn for recognition
of a RangeExpr
, but the +
certainly is not, as it differentiates
between addition and subtraction expressions. The third type of reduction can subsume such
information-bearing token values into attributes, typically
op="
value
"
. For our example the
minimal form would be:
<RangeExpr> <AdditiveExpr op="+"> <IntegerLiteral>1</IntegerLiteral> <IntegerLiteral>5</IntegerLiteral> </AdditiveExpr> <IntegerLiteral>8</IntegerLiteral> </RangeExpr>
Within XPath, Literal
productions come in several flavours, each of which
ultimately end with a string serialisation of the literal value which would be represented
as a text node within the XML. An alternative is to collapse the Literal
production into a constant form, whose value type is described as an attribute and whose
(serialized) value is held inan appropriate attribute value. Hence an double literal can be
represented by any of the following three forms:
<Literal> <NumericLiteral> <DoubleLiteral>3.14159</DoubleLiteral> </NumericLiteral> </Literal> <DoubleLiteral>3.14159</DoubleLiteral> <Literal type="xs:double" value="3.14159"/>
The consistent identification of a Literal
, as a self-contained single element
without children, can make certain forms of manipulation much more concise, avoiding the use
of unions.
The languages being parsed are not static constructs and are being updated and extended, albeit in a highly controlled, rigorous and defined manner. As such we should anticipate support for as-yet-unknown versions of the languages to be required.
There will need to be some mechanisms to be able to query what languages and versions can be parsed from a given implementation.
Equally well the framework might be capable of supporting parsing for other (user-supplied) languages, whose grammars are defined in the EBNF form used within the XML standards world. This needs a great deal of thought and probably will require the generation of parsing functions from the grammar for efficiency.
Most use will be made generating parse trees from an expression. The following functions support this:
The xp:parse
function returns a parse tree for an expression.
xp:parse
($in
as
xs:string
) as
element()
xp:parse
($in
as
xs:string
, $options
as
map(*)
) as
element()
Returns a parse-tree for the XPath/XQuery expression given in $in
.
$options
is a map of control options as described in 4 Options.
In the absence of any options, the default values are used.
[xp:parsing-error] is raised if $in
cannot be parsed
successfully against the given grammar. It is implementation (and language) dependent as to
what and how further information is made available about this failure.
[xp:invalid-option] is raised if one or more of the option values within
$options
is invalid or unrecognised.
Some notes
Parsing a simple range expression, with options to flatten single trees and remove unnecessary tokens:
xp:parse('1 to 8',map{'flatten':true(),'version':2.0}) ==> <RangeExpr> <Literal type="xs:integer" value="1"/> <Literal type="xs:integer" value="8"/> </RangeExpr>
Most use will be made to generate expressions from a (modified) parse tree. The following functions support this:
The xp:to-expression
function returns an expression tree corresponding to a
given parse tree.
xp:to-expression
($in
as
element()
) as
xs:string
xp:to-expression
($in
as
element()
, $options
as
map(*)
) as
xs:string
Returns a string for the XPath/XQuery expression tree given in $in
.
$options
is a map of control options as described in 4 Options.
In the absence of any options, the default values are used.
[xp:invalid-tree] is raised if $in
is not a valid expression
tree, or suitable reduced subtree, for the grammar requested. It is implementation (and
language) dependent as to what and how further information is made available about this
failure.
[xp:invalid-option] is raised if one or more of the option values within
$options
is invalid or unrecognised.
Some notes
Testing whether $data
variable starts with binary content consistent with a
PDF file:
<xsl:variable name="in"> <RangeExpr> <AdditiveExpr op="-"> <Literal type="xs:integer" value="4"/> <Literal type="xs:integer" value="2"/> </AdditiveExpr> <Literal type="xs:integer" value="8"/> </RangeExpr> </xsl:variable> <xsl:value-of select="xp:to-expression($in)"/> ==> "4 - 2 to 8"