# regexp support

From: David Carlisle <davidc@nag.co.uk>
Date: Tue, 22 Jan 2002 23:58:19 GMT
Message-Id: <200201222358.XAA17355@penguin.nag.co.uk>


This got a bit longer than perhaps is wise as an "initial comment"
but some thoughts on regexp in XSLT...

David

Regular expression support in XPath/XSLT/Xquery
===============================================

1)Abstract
----------

This note proposes the addition of support for regular expressions in XSLT 2
beyond that currently proposed by the functions in the XPath 2 draft.

It is loosely based on discussions on and off xsl-list, principally with
Jeni Tennison, although the details of this proposal are mine and Jeni and
others who took part in the xsl-list discussions should feel free to
comment (and disagree!) with any parts of this.

It does propose some possible syntax for this functionality but this is
just a draft; the main aim of the note is to put forward some use cases and
requirements. The exact syntax would probaby need to be refined.

One of the main issues raised by the use cases to be presented here is the
requirement to build a tree fragment (as opposed to a string) based on the
matching (or not) of regular expressions to an input string. The Functions
and Operators draft currently suggests a regexp-replace function but this
is restricted to constructing strings, so is of limited usefulness in an
XSLT (or Xquery) context.

Whilst the extended function definition possibilities in XPath 2 may, in
principle, mean that it would be possible to add such functionality to
Xpath, I think that it is a useful distinction in XSLT that should be
preserved that the principle node creation mechanism is via XSLT, with
Xpath being primarily a selection and expression language. It is quite
likely that Xquery will require similar functionality, and in Xquery a
function form may be natural, but this note concentrates on XSLT and
suggests an extension of the XSLT template mechanism. This should not
inhibit he addition of similar functionality to Xquery with a different
syntax, as differing syntax for node creation is in any case one of the
major distinguishing features between the two languages.

David Carlisle

2) Contents
----------

1) Abstract
2) Contents
3) Regexp Syntax
4) Use Cases
5) Possible XSLT2 Regexp syntax
6) Variants on the propsosal
7) Suggested solutions to the use cases.

3) Regexp Syntax
----------------

It is clearly desirable that the regexp syntax in XPath is largely
compatible with that of XML Schema, however I feel that the requirements of
searching and replacing substrings within a larger input string (the
typical scenarios presented here) are rather different from the
requirements for specifying  regexp that fully match the (typically smaller)
complete character content of a typed element.
Thus in my examples below I will use extended regexp syntax if this seems
appropriate staying compatible with perl wherever possible, (although
personally, I'm more familiar with the slightly different emacs
conventions).

In particular, regexp used for search and replace will need to be
unanchored and special characters will need to be introduced, for anchoring
to start and end of the string and/or lines (^ $\z and \Z in perl regexp syntax) and possibly other meta characters or character classes will need to be added, depending on perceived requirements. However this note does not concentrate on the regexp syntax but rather on the possibilities for making structured replacements. This is the classic "up translation" task of moving from unstructured (or insufficiently structured) data to structured matkup. 3) Use Cases ------------ RE-1: HTML Line Break In an input string replace all line end characters (which we may assume have already been normalised to #xA) by the element <br/>. This is one of the more common requests on XSL-list. It does not actually require any regular expression support as it is searching for a single character (although one may possibly want to search for other line end characters if the string has not been through an XML line end normalisation, as presumably (?) is the case for the unparsed-text() function.) It does however demonstrate the need to generate element nodes at positions determined by searching a string. RE-2: Natural language date strings. Here the aim is to split up the (complete) string which is known to be a date in (say) English language format such as "17th January, 2002" and produce the string in ISO format 2002-01-17 suitable for coercing to a dateTime, and being used with Xpath date expressions. RE-3: Parsing a CSV file into an XML tree fragment. Given an input string 1,2 3,4 produce the tree fragment equivalent to <table> <row><cell>1</cell><cell>2</cell></row> <row><cell>3</cell><cell>4</cell></row> </table> RE-4: Multiple regexp-replace. The proposed replace function in F&O replaces substrings matching a single regexp but often one wants to replace many strings in parallel. I am assuming here that the normal XSLT creation model is followed that _all_ replacements take place (where possible, with a suitable priority mechanism for controlling clashes) on (substrings of) the original string, and a new node tree is constructed. Even when generating strings (as here) this differs from the result of repeatedly calling the replace function proposed in the F&O draft as that would, most naturally, apply later regexp matching to the _result_ of earlier matches. An example recently mentioned on xml-dev: RE-4a: Going from an XML unicode string to TeX: replace & by \&$     by \$#169 by \copyright #233 by \'{e} < by \lt #322 by \l ... RE-4b: The reverse of this transformation. RE-5: Calculation of "dynamic" regular expressions. Whilst most suggested "template match" extensions to regexp matching have assumed constant match strings, at least some support should be available to build up regular expressions as the result of XPath expressions. For example a stylesheet might accept a word (or list of words) as a parameter and build up the regexp adding word boundary markers (\b in perl) and and alternation (|). This string valued expression should then be usable as a regular expression. (From a user perspective, all uses of regexp could be replaced by string valued expressions (or avt) although efficiency and other considerations may not allow all uses of regexp to have this flexibility). RE-6: Nested structures. Input strings with arbitrary nested structure for example, well formed HTML, TeX \aaa{...} syntax, lisp (...) syntax, are not regular languages and so can not (by definition) be parsed by a single regular expression. However in many cases (including all of the examples cited above) the tokens delimiting the nesting may be matched by regular expressions. This should allow the input string to be tokenised using regexp into a format in which the recursion and/or counting required to handle the nested structure may be handled by standard constructs in the language (XPath or XSLT or Xquery). Some people have suggested that XSLT should be directly extended to support the specification of more general grammars, in the style of lex/yacc. But the proposal here is that regexp support, if sufficiently integrated into the existing functionality of the language, should be able to handle a large range of cases without the complications of explictly adding more general parsing support. RE-6a: TeX (simplified) Convert \$$[a-z])+{....} to <\1>...</\1> but with special case of \\frac{....}{....} to <frac><num>...</num><denom>...</denom></frac> For example \frac{1 + \sin{2}}{3 + \cos{4}} to <frac> <num>1 + <sin>2</sin></num> <denom>3 + <cos>4</cos></denom> </frac> RE-6b: Well formed XML markup Parse a well formed XML instance that has no DOCTYPE, entity or character references or attributes. (These could be added but without adding any major new issues.) For example convert the input xml <entry><![CDATA[<abc>12 <x/> <wxyz>34</wxyz></abc>]]></entry> To <entry><abc>12 <x/> <wxyz>34</wxyz></abc></entry> RE-6c: HTML Markup. As above but with HTML, in particular with implied end tags. In general this requires a DTD and knowledge of SGML omitted tag rules. To handle general HTML as it appears in the wild, arbitrarily complicated "tag soup" parsing heuristics as implemented in the browsers would be needed. However this appears to be a very common requirement often generated by storing HTML fragments as strings in a database. One may hope that specific simple cases may be handled for example: Convert a list <entry><![CDATA[<ol><li>aaa <li>bbb </ol>]]></entry> to <entry><ol><li>aaa </li><li>bbb </li></ol></entry> RE-7: Transliteration Take an input string in AMS cyrillic transliteration scheme and convert to Unicode characters. The exact scheme will be omitted here but the details are available at http://www.tex.org. This differs from the "multiple regexp" example in the way conflicting regexp matches need to be handled. For multiple regexp matching above one needs a priority mechanism so that certain regexp are matched first and lower priority regexp are only applied to remaining strings. Transliteration matches need to be applied by matching the start of the input string with the longest possible match, replacing this by the transliterated sequence, and then finding he longest possible match at the start of the remaining string. Thus if abc transliterates to X and bcd transliterates to Y xab Z c C d D then abcd -> XD xabcd -> ZCD Thus you could not, for example, start by replacing all abc by X. RE-8: Free format text input. This example is based on a (real) question in xsl-list. (using | denote line start, ignoring indentation for this mail) |Some heading, with subphrases: | An item without a bullet. | Name = value pair. | Property: value. | Score = 7 (a = 1, b =3, c = 4). | A full sentence that has so many words that it spans | multiple lines. | Sometimes we can't even trust whether people get the |indention consistent. | |and making it: | |<entry> | <heading> | Some heading | <subheading>with subheading:</subheading> | </heading> | <item> | <heading>An item without a bullet.</heading> | <pair name='name' value='value pair.'/> | <pair name='property' value='value.'/> | <pair name='Score' value='7'> | <pair name='a' value='1'/> | <pair name='b' value='3'/> | <pair name='c' value='4'/> | </pair> | <sentence>A full sentence that has so many words that it spans | multiple lines.</sentence> | <sentence>Sometimes we can't even trust whether people get the |indention consistent.</sentence> | </item> |</entry> | | 5) Possible XSLT2 Regexp syntax ------------------------------- The basic idea outlined in the proposal below is that the main task in all the above use cases is the construction of a result tree given some input. The construction aspects of the new functionality should therefore be designed to match existing construction possibilities, with the only difference being that they are triggered by a substring of an input string matching a regexp rather than a node in an input tree matching an Xpath. A new instruction: <xsl:apply-regexp-templates> Taking same attributes and content as apply-templates. The select attribute should evaluate to a sequence of string-valued items. If more than one string is in the sequence, the result is the sequence produced by concatenating the result of processing each string. In fact all the examples presented will always have a sequence of at most one (that is, a string) and it would be possible to specify that the argument should be a single string if that proves to be a useful simplification. A new top level instruction: <xsl:regexp-template> Taking a mandatory match attribute optional priority attribute. and optional mode attribute. The mode attribute works as for xsl:template, if xsl:apply-regexp templates specifies a mode then only regexp-templates's declared for that mode will be considered. The match attribute takes a regular expression, ie a restricted form of string. It is assumed here that these are essentially fixed strings, if implementation/efficiency concerns allow they could perhaps be attribute value templates to allow more dynamic choice in the regular expressions. Or, equivalently to AVT, but with slightly different syntax, the match attribute could take arbitrary xpath expressions so long as they evaluated to a string that was a legal regexp. (As another variant not further explored here one could consider regexp to be a derived type from string rather than typing regexps as strings and just stating at a meta level that they have to match regexp syntax). So a typical example, meeting the first use case, would be <xsl:template match="xx"/> <div> <xsl:apply-regexp-templates match="."/> </div> </xsl:template> ... <xsl:regexp-template match="&#10;"> <br/> </xsl:regexp-template> The template matching <xx> would then result in the character data of xx being copied into a div element in the output, with all new line characters become html br elements. A new XSLT-specific XPath function current-match(). Within a regexp-template current-match will return a sequence of 1 or more strings the ith item being the substring matching the ith parenthised expression in the regexp of the template. thus given a regexp of (a*)(b*) matching aaabb then within the template, . will be "aaabb" current-match()[1] will be "aaa" and current-match()[2] will be "bb". In the presence of alternation (|) and repeat clauses ({3}) it isn't always immediately clear how subexpressions should be numbered but perl semantics should be followed (as schema explicitly tries to be perl like in its regexp semantics). In detail the execution model would be as follows. a. The select attribute of apply-regexp-templates is evaluated. If it is not a sequence of strings an error is raised. If it is a sequence each is processed separately and the result is the sequence of results. So we need only consider a single string. b. If a mode is supplied all regexp-templates in that mode are now consided, otherwise all the regexp templates in the default mode are considered. c. The templates being considered are then ordered by priority. If two templates of equal priority could potentially match overlapping strings then eitherthis would be an error or a default priority scheme would enforce an ordering (either implementation defined or order in stylesheet) (to be decided). For each template in turn, starting with highest priority, the regexp is matched on the subsequences of the original input string that have yet to be matched. Once a match is found the sequence of substrings is extended by splitting the current substring into three: the substring-before the matched substring and the substring after. This continues, finally using a default template of which matches every character "(.|&#10;)" until the end result is that the original string is now a sequence of substrings each associated with a template whose match regexp produced the substring. Now the focus is set such that this derived sequence is the current sequence, and each of the associated regexp-templates is executed in order with the current item being the matched substring and position() being the position in the derived sequence. This means that the regexp's can access the matched string using . finer control, accessing subexpressions can be achieved using the match-string() function described above. d. The resulting sequence is the concatenation of the results of each of these templates. 6) Variants on the propsosal ---------------------------- This section discusses some variants on the above. The variations are often mutually exclusive, it is not proposed that all these features are added simultaneously, although of course a system that incorporated some features of more than one of these variants would also be possible. V1: Tokenise Function It would be possible to make the implicit splitting up of the input string into a sequence of substrings explicit. <xsl:apply-regexp-templates select="string"/> would be replaced by <xsl:apply-regexp-templates select="regexp-tokenise(string)"/> where regexp-tokenise would a set of regexp's (specified by a method to be determined) and split up the string as above. The difference would be that the sequence resulting would "just" be a sequence of strings, which would make it a first class object in the data model. Of course apply-regexp-templates would take any sequence of strings they would not be forced to be generated by tokenise() (although in practice that would be most common). In this model the semantics of the match expression in <xsl:regexp-template is slightly different. rather than matching on a substring of some input string, it would be an _anchored_ match and the template would fire if the regexp matched an entire string in the input sequence. As in the main proposal . and position() etc would reflect the position of the matched item in the input sequence. The advantage of this model is that the sequence is made explicit as a standard XPath sequence. The disadvantage is that the regexps may have to be specified (and executed) twice. Once as unanchored regexps to tokenise the input string into a sequence of substrings, and then again as anchored regexps to associate templates with each of the matched substrings. V2: Immediate template execution Rather than first building an implicit sequence of substrings the mechanism could be that as each regexp is matched against a substring of the original string, a sequence is built as in the main proposal with the string-before and string-after the match but in this case between these is placed the result of executing the template. This avoids having to build the sequence of strings "associating" each one with a template, but it is harder to suggest good values for . and position() in this case. possibly y position() should be 1 and . should be the original string (in the case that apply-regexp-templates was just given a select expression of a single string. In particular the focus would not change as regexp-templates were applied. (Similar to named templates.) V1 and V2 produce (at least for templates not using an implicit setting of . or position()) the same results as the main proposal. The last variant has a different model of conflict resolution for overlapping matches and will typically produce different a result given similar looking regexp. V3: left-to-right matching. Rather than match regexp in priority-order, an alternative matching scheme would be to start from the start of the string and find the longest possible match from all the templates under consideration. priority specifications would choose between regexps if more than one matched the longest possible initial string. The template for this regexp would fire. Processing could either then immediately proceed to the remaining substring, with the system finding the longest initial match on the remainder, or processing could effectively stop as soon as a match was found, with teh remaining string being available (say with a string-sfter-match() function) and so the template could explicitly invoke apply-regexp-templates select="substring-after-match()" at some suitable point in its execution. In this model . would be the matched substring and position() would be (say) always 1. V4: Support for matching pairs. Many of the examples (as in the RE-6 use cases) require matching a nested structure which is beyond what is possible with a single regexp. Existing XSLT facilities are sufficient to "fill the gap" providing the necessary arithmetic and state to handle the nested parse tree. However one could consider adding faclities to make the required transformations easier. It is essentially a grouping problem although the new Grouping constructs in XSLT2 didn't seem immediately applicable. As an alternative to adding general grouping support for this kind of task it has been suggested that a special template that matches on substrings between matching tokens matched by a "start" and and "end" regexp could be provided. <xsl:regexp-balancing-template match-start="\\func\{" match-end="\}"> This would take two regexp match attributes/ The template would be handed the intervening string as well as the two matching strings. The system would handle the necessary counting to ensure that the start and end expressions were correctly paired. 7) Suggested solutions to the use cases. --------------------------------------- Only solutions using the "main proposal" are given here. RE-1 <xsl:template match="xx"/> <div> <xsl:apply-regexp-templates match="."/> </div> </xsl:template> ... <xsl:regexp-template match="&#10;"> <br/> </xsl:regexp-template> As commented above this doesn't use regexp but is a natural simple example, that is quite hard to do in XSLT1 (or even XSLT2 as currently drafted) If the input string has not been through an XML parser (as will be possible in XSLT2) then even this case might benefit from simple regexp support, changing the match to &#13;&#10;?|&#10; RE-2 <xsl:apply-regexp-templates select="'17th January, 2002'"/> ... <xsl:regexp-template match="^ *([0-9][0-9]) +([A-Za-z]{3})[a-z]* +([0-9]+) *"> <xsl:value-of select="format-number(current-match()[1],'00')"/> <xsl:choose> <xsl:when test="lower(current-match()[1])='jan'">-01-</xsl:when> <xsl:when test="lower(current-match()[1])='feb'">-02-</xsl:when> ... </xsl:choose> <xsl:variable name="y" select="number(current-match()[3])"/> <xsl:value-of select="format-number( <xsl:value-of select="if(x &lt; 10) then (1900 + x) else (if(x &lt; 1000) then( 2000 + x) else x) "/> </xsl:regexp-template> .... RE-3 Assuming the CSV string is the current item. <table> <xsl:apply-regexp-templates mode="row" select="."/> </table> .. <xsl:regexp-template mode="row" match="^(.*)"> <row> <xsl:apply-regexp-templates mode="cell" select="."/> </row> </xsl:template> <xsl:regexp-template mode="cell" match="([^,]*)(,|)"> <cell> <xsl:value-of select="current-match()[1]"/> </cell> </xsl:template> Note this assumes that ^ . and are line based even within a larger string. (This is like emacs regexp, but unlike sed. Perl has a switch that allows ^ and to change between matching ends of lines and matching the ends of the string.) RE-4a <xsl:regexp-template mode="unicodetotex" match="\"> <xsl:text>\</xsl:text> </xsl:regexp-template> ... RE-4b <xsl:regexp-template mode="textounicodetotex" match="\\"> <xsl:text></xsl:text> </xsl:regexp-template> The only interest here would be the use or priority to control tex macro names which would match the same regexp. <xsl:regexp-template mode="textounicodetotex" match="\\lt" priority="2"> <xsl:regexp-template mode="textounicodetotex" match="\\l" priority="1"> although an alternative would be to explicitly end each regexp with a match for a non-letter (as TeX macro names only consist of letters (or a single non-letter, such as \)) RE-5 Given a top level param keyword containing a word to be highlighted one or more construct should allow string expression such as concat('\b',keyword,\b') or an AVT \b{keyword}\b in order to construct the required regexp to match this word. RE-6a <xsl:regexp-template match="\\([a-z]+){" priority="2"> <start name="current-match()[1]"/> </xsl:regexp-template> <xsl:regexp-template match="{" priority="1"> <start name=""/> </xsl:regexp-template> <xsl:regexp-template match="}"> <end/> </xsl:regexp-template> Applying the above regexp templates to the example \frac{1 + \sin{2}}{3 + \cos{4}} would produce the following sequence (of text nodes and elements), putting them in a containing <x> element, so as to test the following stylesheet. <start name="frac"/>1 + <start name="sin"/>2<end/><end/><start name=""/>3 + <start name="cos"/>24<end/><end/> To get to here <frac> <num>1 + <sin>2</sin></num> <denom>3 + <cos>4</cos></denom> </frac> we just need to use standard XSLT constructs, for example this XSLT1 stylesheet does the job. It is however noticeable that to handle the nesting in this sequence the current XPath2 constructs are very limited, primarily due to the lack of higher order functions. The provided "for" operator is only really suitable when each item of a sequence is to be processed independently. A standed functional operator such as fold would allow accumulation of information along the sequence. Here this problem is circumvented by first converting the sequence to a tree so that the sibling-axis gives the required access, but this seems at odds with the apparent desire to make such operations possible at the sequence level. <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:template match="x"> <xsl:text> </xsl:text> <x> <xsl:apply-templates mode="match" select="node()[1]"/> </x> </xsl:template> <xsl:template match="text()" mode="match"> <xsl:param name="n" select="0"/> <xsl:copy-of select="."/> <xsl:apply-templates mode="match" select="following-sibling::node()[1]"> <xsl:with-param name="n" select="n"/> </xsl:apply-templates> </xsl:template> <xsl:template match="start" mode="match"> <xsl:param name="n" select="0"/> <xsl:variable name="end" select="following-sibling::end[count(following-sibling::end)-count(following-sibling::start)=n][1]"/> <xsl:element name="{@name}"> <xsl:apply-templates select="following-sibling::node()[1]" mode="match"> <xsl:with-param name="n" select="n+1"/> </xsl:apply-templates> </xsl:element> <xsl:apply-templates select="end/following-sibling::node()[1]" mode="match"> <xsl:with-param name="n" select="n"/> </xsl:apply-templates> </xsl:template> <xsl:template match="start[@name='frac']" mode="match"> <xsl:param name="n" select="0"/> <frac> <num> <xsl:apply-templates select="following-sibling::node()[1]" mode="match"> <xsl:with-param name="n" select="n+1"/> </xsl:apply-templates> </num> <denom> <xsl:apply-templates select="following-sibling::end[count(following-sibling::end)-count(following-sibling::start)=n][1]/following-sibling::node()[2]" mode="match"> <xsl:with-param name="n" select="n+1"/> </xsl:apply-templates> </denom> </frac> <xsl:apply-templates select="following-sibling::end[count(following-sibling::end)-count(following-sibling::start)=n][2]/following-sibling::node()[1]" mode="match"> <xsl:with-param name="n" select="n"/> </xsl:apply-templates> </xsl:template> <xsl:template match="end" mode="match"/> </xsl:stylesheet> RE-6b. Parsing well formed XML does not really present any difficulties not presented in the TeX case. The complication of macros taking more than one {} group as arguments (as in the frac example above) does not occur, although the regexps would need to be extended to deal with empty element syntax and attributes. Full details are not presented here. RE-6c. As mentioned above, the general case of parsing HTML is out of scope however simple cases of omitted tags could be dealt with using the priority attribute on templates. a high priority template matching "</li>\s-*<li>" would handle the case where the end tag was explicit, and a lower priority template matching "<li>" would match in other cases, handling the implied closing of the previous element. RE-7 This case differs from RE-4 as there is a strong left-to-right (or at least reading direction) bias. Replacements should happen at the start of the string. One possibe solution is to simply prefix all regexp by a ^ character to denote the start of the string. One slight subtlety is that this use of ^ relies on the the fact that teh string is "split up" as each string is found, and later regexp apply to the sequence of remaining unmatched portions. If ^ always denotes the start of the original string then prefixing all the transliteration replacements by ^ would clearly not have the desired effect and only the initial characters in teh original string would be replaced. RE-8 <xsl:regexp-template match="^ +([^:= ]+)\s+[:=]\s+(.*)" ...some standard template contains ... <entry> <xsl:apply-regexp-templates select="." /> </entry> <!-- matches headings --> <xsl:regexp-template match="^([^,]+), +\([^:]+$$:$">
<xsl:value-of select="current-match()[1]" />
<xsl:value-of select="current-match()[2]" />
</xsl:regexp-template>

<!-- matches items -->
<!-- note by having an explicit regexp to pick up the item text you can
easily extend to the case where you want to match as far as the next
here I'm using the regexp \' which is emacs-regexp for end-of-string
(as opposed to $which is end-of-line) so I'll grab everything, for now. --> <xsl:regexp-template match="^ ([^ ].*)$(.|\n)*\'">
<item>
<xsl:apply-regexp-templates select="current-match()[2]" />
</item>
</xsl:regexp-template>

<!-- matches pairs -->
<xsl:regexp-template match="^     ([^:= ]+)\s+[:=]\s+([^$$]*)(\([^\($$]*\))?\$">
<pair name="{current-match()[1]}"
value="{current-match()[2]}">
<xsl:apply-regexp-templates select="current-match()[3]"
mode="pair" />
</pair>
</xsl:regexp-template>

<!-- matches nested pairs -->
<xsl:regexp-template match="^([^:= ]+)\s+[:=]\s+([^,]+),?"
mode="pair">
<pair name="{current-match()[1]}" value="{current-match()[2]}" />
</xsl:regexp-template>

<!-- matches sentences -->
<xsl:regexp-template match="^     ([^\.]+)." priority="-1">
<sentence>
<xsl:value-of select="concat(current-match()[1], '.')" />
</sentence>
</xsl:regexp-template>

_____________________________________________________________________
This message has been checked for all known viruses by Star Internet
delivered through the MessageLabs Virus Scanning Service. For further
information visit http://www.star.net.uk/stats.asp or alternatively call
Star Internet for details on the Virus Scanning Service.
`
Received on Tuesday, 22 January 2002 18:58:36 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:44:22 UTC