RE: regexp support

Many thanks for this note. We had a discussion on regular expressions at the
XSL WG meeting earlier this week; we agreed that the facilities currently
specified in Functions & Operators are probably inadequate to meet the XSLT
requirement, and that it would be useful to start by producing a set of use
cases to test any proposed improvements against. Your input is therefore
very valuable.

Mike Kay

> -----Original Message-----
> From: David Carlisle [mailto:davidc@nag.co.uk]
> Sent: 22 January 2002 23:58
> To: xsl-editors@w3.org
> Subject: regexp support
> 
> 
> 
> This got a bit longer than perhaps is wise as an "initial comment"
> but some thoughts on regexp in XSLT...
> 
> David
> 
> 
> 
> Regular expression support in XPath/XSLT/Xquery
> =============================================== 
> 
> 1)Abstract
> ----------
> 
> This note proposes the addition of support for regular 
> expressions in XSLT 2
> beyond that currently proposed by the functions in the XPath 2 draft.
> 
> It is loosely based on discussions on and off xsl-list, 
> principally with
> Jeni Tennison, although the details of this proposal are mine 
> and Jeni and
> others who took part in the xsl-list discussions should feel free to
> comment (and disagree!) with any parts of this.
> 
> It does propose some possible syntax for this functionality 
> but this is
> just a draft; the main aim of the note is to put forward some 
> use cases and 
> requirements. The exact syntax would probaby need to be refined.
> 
> One of the main issues raised by the use cases to be 
> presented here is the
> requirement to build a tree fragment (as opposed to a string) 
> based on the
> matching (or not) of regular expressions to an input string. 
> The Functions
> and Operators draft currently suggests a regexp-replace 
> function but this
> is restricted to constructing strings, so is of limited 
> usefulness in an
> XSLT (or Xquery) context.
> 
> Whilst the extended function definition possibilities in 
> XPath 2 may, in
> principle, mean that it would be possible to add such functionality to
> Xpath, I think that it is a useful distinction in XSLT that should be
> preserved that the principle node creation mechanism is via XSLT, with
> Xpath being primarily a selection and expression language. It is quite
> likely that Xquery will require similar functionality, and in Xquery a
> function form may be natural, but this note concentrates on XSLT and
> suggests an extension of the XSLT template mechanism. This should not
> inhibit he addition of similar functionality to Xquery with a 
> different
> syntax, as differing syntax for node creation is in any case 
> one of the
> major distinguishing features between the two languages.
> 
> David Carlisle
> 
> 
> 
> 2) Contents
> ----------
> 
> 1) Abstract
> 2) Contents 
> 3) Regexp Syntax
> 4) Use Cases
> 5) Possible XSLT2 Regexp syntax
> 6) Variants on the propsosal
> 7) Suggested solutions to the use cases.
> 
> 
> 3) Regexp Syntax
> ----------------
> 
> It is clearly desirable that the regexp syntax in XPath is largely
> compatible with that of XML Schema, however I feel that the 
> requirements of
> searching and replacing substrings within a larger input string (the
> typical scenarios presented here) are rather different from the
> requirements for specifying  regexp that fully match the 
> (typically smaller)
> complete character content of a typed element. 
> Thus in my examples below I will use extended regexp syntax 
> if this seems
> appropriate staying compatible with perl wherever possible, (although
> personally, I'm more familiar with the slightly different emacs
> conventions).
> 
> In particular, regexp used for search and replace will need to be
> unanchored and special characters will need to be introduced, 
> for anchoring
> to start and end of the string and/or lines (^ $ \z and \Z in 
> perl regexp
> syntax) and possibly other meta characters or character 
> classes will need
> to be added, depending on perceived requirements. However 
> this note does
> not concentrate on the regexp syntax but rather on the 
> possibilities for
> making structured replacements.  This is the classic "up 
> translation" task
> of moving from unstructured (or insufficiently structured) data to
> structured matkup.
> 
> 
> 3) Use Cases
> ------------
> 
> RE-1: HTML Line Break
>   In an input string replace all line end characters (which 
> we may assume
>   have already been normalised to #xA) by the element <br/>.
> 
>   This is one of the more common requests on XSL-list. It 
> does not actually
>   require any regular expression support as it is searching 
> for a single
>   character (although one may possibly want to search for 
> other line end
>   characters if the string has not been through an XML line end
>   normalisation, as presumably (?) is the case for the unparsed-text()
>   function.) It does however demonstrate the need to generate 
> element nodes
>   at positions determined by searching a string.
> 
> RE-2: Natural language date strings.
>   Here the aim is to split up the (complete) string which is 
> known to be a
>   date in (say) English language format such as "17th January, 2002"
>   and produce the string in ISO format 2002-01-17 suitable 
> for coercing to
>   a dateTime, and being used with Xpath date expressions.
> 
> RE-3: Parsing a CSV file into an XML tree fragment.
>   Given an input string
>   1,2
>   3,4
>   produce the tree fragment equivalent to 
>   <table>
>   <row><cell>1</cell><cell>2</cell></row>
>   <row><cell>3</cell><cell>4</cell></row>
>   </table>
> 
> RE-4: Multiple regexp-replace.
>   The proposed replace function in F&O replaces substrings matching a
>   single regexp but often one wants to replace many strings 
> in parallel.
> 
>   I am assuming here that the normal XSLT creation model is 
> followed that
>   _all_ replacements take place (where possible, with a 
> suitable priority
>   mechanism for controlling clashes) on (substrings of) the original
>   string, and a new node tree is constructed. Even when 
> generating strings
>   (as here) this differs from  the result of repeatedly 
> calling the replace
>   function proposed in the F&O draft as that would, most 
> naturally, apply
>   later regexp matching to the _result_ of earlier matches.
> 
>   An example recently mentioned on xml-dev:
> 
> RE-4a: Going from an XML unicode string to TeX:
>     replace &     by \&
>             $     by \$
>             #169  by \copyright
>             #233  by \'{e}
>             <     by \lt
>             #322 by \l
>             ...
> RE-4b: The reverse of this transformation.
> 
> RE-5: Calculation of "dynamic" regular expressions.
>   Whilst most suggested "template match" extensions to regexp matching
>   have assumed constant match strings, at least some support should be
>   available to build up regular expressions as the result of XPath
>   expressions. For example a stylesheet might accept a word 
> (or list of
>   words) as a parameter and build up the regexp adding word boundary
>   markers (\b in perl) and and alternation (|). This string valued
>   expression should then be usable as  a regular expression.
>   (From a user perspective, all uses of regexp could be replaced by
>   string valued expressions (or avt) although efficiency and other
>   considerations may not allow all uses of regexp to have this
>   flexibility).
> 
> RE-6: Nested structures.
>   Input strings with arbitrary nested structure for example, well
>   formed HTML, TeX \aaa{...} syntax, lisp (...) syntax, are not
>   regular languages and so can not (by definition) be parsed by a
>   single regular expression. However in many cases (including all of
>   the examples cited above) the tokens delimiting the nesting may be
>   matched by regular expressions. This should allow the input string
>   to be tokenised using regexp into a format in which the recursion
>   and/or counting required to handle the nested structure may be
>   handled by standard constructs in the language (XPath or XSLT or
>   Xquery).
> 
>   Some people have suggested that XSLT should be directly extended to
>   support the specification of more general grammars, in the style of
>   lex/yacc. But the proposal here is that regexp support, if 
> sufficiently
>   integrated into the existing functionality of the language, 
> should be able
>   to handle a large range of cases without the complications 
> of explictly
>   adding more general parsing support.
> 
> RE-6a: TeX (simplified)
>   Convert 
>     \\([a-z])+{....}  to <\1>...</\1>
>     but with special case of
>     \\frac{....}{....}  to 
> <frac><num>...</num><denom>...</denom></frac>
> 
>   For example
>    \frac{1 + \sin{2}}{3 + \cos{4}}
>   to
>    <frac>
>      <num>1 + <sin>2</sin></num>
>      <denom>3 + <cos>4</cos></denom>
>   </frac>
> 
> RE-6b: Well formed XML markup
>   Parse a well formed XML instance that has no DOCTYPE, entity or
>   character references or attributes. (These could be added 
> but without
>   adding any major new issues.)
> 
>   For example convert the input xml
>    <entry><![CDATA[<abc>12 <x/>  <wxyz>34</wxyz></abc>]]></entry>
>   To
>    <entry><abc>12 <x/>  <wxyz>34</wxyz></abc></entry>
> 
> RE-6c: HTML Markup.
>   As above but with HTML, in particular with implied end 
> tags. In general
>   this requires a DTD and knowledge of SGML omitted tag 
> rules. To handle
>   general HTML as it appears in the wild, arbitrarily complicated "tag
>   soup" parsing heuristics as implemented in the browsers would be
>   needed. However this appears to be a very common requirement often
>   generated by storing HTML fragments as strings in a 
> database. One may
>   hope that specific simple cases may be handled for example:
> 
>   Convert a list 
>    <entry><![CDATA[<ol><li>aaa <li>bbb </ol>]]></entry>
> to
>    <entry><ol><li>aaa </li><li>bbb </li></ol></entry>
> 
> RE-7: Transliteration
>   Take an input string in AMS cyrillic transliteration scheme 
> and convert
>   to Unicode characters. The exact scheme will be omitted here but the
>   details are available at http://www.tex.org.
>   This differs from the "multiple regexp" example
>   in the way conflicting regexp matches need to be handled. 
> For multiple
>   regexp matching above one needs a priority mechanism so that certain
>   regexp are matched first and lower priority regexp are only 
> applied to
>   remaining strings. Transliteration matches need to be applied by
>   matching the start of the input string with the longest 
> possible match,
>   replacing this by the transliterated sequence, and then finding he
>   longest possible match at the start of the remaining string.
>   Thus if abc transliterates to X and 
>           bcd transliterates to Y
>           xab                   Z
>           c                     C
>           d                     D
>   then
>     abcd  -> XD
>     xabcd -> ZCD
>   Thus you could not, for example, start by replacing all abc by X.
> 
> 
> RE-8: Free format text input.
>       This example is based on a (real) question in xsl-list.
>       (using | denote line start, ignoring indentation for this mail)
> 
>   |Some heading, with subphrases:
>   |   An item without a bullet.
>   |     Name = value pair.
>   |     Property: value.
>   |     Score = 7 (a = 1, b =3, c = 4).
>   |     A full sentence that has so many words that it spans
>   |         multiple lines.
>   |     Sometimes we can't even trust whether people get the
>   |indention consistent.
>   |
>   |and making it:
>   |
>   |<entry>
>   |   <heading>
>   |      Some heading
>   |      <subheading>with subheading:</subheading>
>   |   </heading>
>   |   <item>
>   |     <heading>An item without a bullet.</heading>
>   |        <pair name='name' value='value pair.'/>
>   |        <pair name='property' value='value.'/>
>   |        <pair name='Score' value='7'>
>   |            <pair name='a' value='1'/>
>   |            <pair name='b' value='3'/>
>   |            <pair name='c' value='4'/>
>   |        </pair>
>   |        <sentence>A full sentence that has so many words 
> that it spans
>   |         multiple lines.</sentence>
>   |        <sentence>Sometimes we can't even trust whether 
> people get the
>   |indention consistent.</sentence>
>   |   </item>
>   |</entry>
>   |
>   |
> 
> 
> 
> 5) Possible XSLT2 Regexp syntax
> -------------------------------
> 
> The basic idea outlined in the proposal below is that the main task in
> all the above use cases is the construction of a result tree given
> some input. The construction aspects of the new functionality should
> therefore be designed to match existing construction possibilities,
> with the only difference being that they are triggered by a substring
> of an input string matching a regexp rather than a node in an input
> tree matching an Xpath.
> 
> A new instruction: <xsl:apply-regexp-templates>
> 
>  Taking same attributes and content as apply-templates.
>  The select attribute should evaluate to a sequence of string-valued
>  items. If more than one string is in the sequence, the result is the
>  sequence produced by concatenating the result of processing each
>  string. In fact all the examples presented will always have a
>  sequence of at most one (that is, a string) and it would be possible
>  to specify that the argument should be a single string if that proves
>  to be a useful simplification.
> 
> A new top level instruction: <xsl:regexp-template>
> 
>   Taking a mandatory match attribute
>   optional priority attribute.
>   and optional mode attribute.
> 
>   The mode attribute works as for xsl:template, if xsl:apply-regexp
>   templates specifies a mode then only regexp-templates's declared for
>   that mode will be considered.
> 
>   The match attribute takes a regular expression, ie a restricted form
>   of string. It is assumed here that these are essentially fixed
>   strings, if implementation/efficiency concerns allow they could
>   perhaps be attribute value templates to allow more dynamic choice in
>   the regular expressions. Or, equivalently to AVT, but with slightly
>   different syntax, the match attribute could take arbitrary xpath
>   expressions  so long as they evaluated to a string that was a legal
>   regexp. (As another variant not further explored here one could
>   consider regexp to be a derived type from string rather than typing
>   regexps as strings and just stating at a meta level that they have
>   to match regexp syntax).
> 
> So a typical example, meeting the first use case, would be
>   <xsl:template match="xx"/>
>   <div>
>   <xsl:apply-regexp-templates match="."/>
>   </div>
>   </xsl:template>
>   ...
>   <xsl:regexp-template match="&#10;">
>   <br/>
>   </xsl:regexp-template>
> 
>   The template matching <xx> would then result in the character data
>   of xx being copied into a div element in the output, with all new
>   line characters become html br elements.
> 
> A new XSLT-specific XPath function current-match().
>   Within a regexp-template current-match will return a sequence of 1
>   or more strings the ith item being the substring matching the ith
>   parenthised expression in the regexp of the template.
>   thus given a regexp of (a*)(b*) matching aaabb then within the
>   template, . will be "aaabb" current-match()[1] will be "aaa" and
>   current-match()[2] will be "bb". In the presence of alternation (|)
>   and repeat clauses ({3}) it isn't always immediately clear how
>   subexpressions should be numbered but perl semantics should be
>   followed (as schema explicitly tries to be perl like in its regexp
>   semantics).
>   
> 
> In detail the execution model would be as follows.
> 
> a. The select attribute of apply-regexp-templates is evaluated.
>    If it is not a sequence of strings an error is raised.
>    If it is a sequence each is processed separately and the result is
>    the sequence of results.
>    So we need only consider a single string.
> 
> b. If a mode is supplied all regexp-templates in that mode are now
>    consided, otherwise all the regexp templates in the default mode
>    are considered.
> 
> c. The templates being considered are then ordered by priority.
>    If two templates of equal priority could potentially match
>    overlapping strings then eitherthis would be an error or a default
>    priority scheme would enforce an ordering (either implementation
>    defined or order in stylesheet) (to be decided).
> 
>    For each template in turn, starting with highest priority,
>    the regexp is matched on the subsequences of the original input
>    string that have yet to be matched. Once a match is found the
>    sequence of substrings is extended by splitting the current
>    substring into three: the substring-before the matched substring
>    and the substring after. This continues, finally using a default
>    template of which matches every character "(.|&#10;)" until the end
>    result is that the original string is now a sequence of
>    substrings each associated with a template whose match regexp
>    produced the substring.
>    Now the focus is set such that this derived sequence is the current
>    sequence, and each of the associated regexp-templates is executed
>    in order with the current item being the matched substring and
>    position() being the position in the derived sequence.
> 
>    This means that the regexp's can access the matched string using .
>    finer control, accessing subexpressions can be achieved using
>    the match-string() function described above.
> 
> d. The resulting sequence is the concatenation of the results of each
>    of these templates.
> 
> 
> 
> 
> 6) Variants on the propsosal
> ----------------------------
> This section discusses some variants on the above. The variations are
> often mutually exclusive, it is not proposed that all these features
> are added simultaneously, although of course a system that
> incorporated some features of more than one of these variants would
> also be possible.
> 
> V1: Tokenise Function
>   It would be possible to make the implicit splitting up of the input
>   string into a sequence of substrings explicit.
>   <xsl:apply-regexp-templates select="$string"/>
>   would be replaced by  
>   <xsl:apply-regexp-templates select="regexp-tokenise($string)"/>
>   where regexp-tokenise would a set of regexp's (specified by a method
>   to be determined) and split up the string as above. The difference
>   would be that the sequence resulting would "just" be a sequence of
>   strings, which would make it a first class object in the data model.
>   Of course apply-regexp-templates would take any sequence of strings
>   they would not be forced to be generated by tokenise() (although in
>   practice that would be most common).
> 
>   In this model the semantics of the match expression in
>   <xsl:regexp-template
>   is slightly different. rather than matching on a substring of some
>   input string, it would be an _anchored_ match and the template would
>   fire if the regexp matched an entire string in the input sequence.
>   As in the main proposal . and position() etc would reflect the
>   position of the matched item in the input sequence.
> 
>   The advantage of this model is that the sequence is made explicit
>   as a standard XPath sequence. The disadvantage is that the regexps
>   may have to be specified (and executed) twice. Once as unanchored
>   regexps to tokenise the input string into a sequence of substrings,
>   and then again as anchored regexps to associate templates with each
>   of the matched substrings.
> 
> 
> V2: Immediate template execution
>   Rather than first building an implicit sequence of substrings the
>   mechanism could be that as each regexp is matched against a 
> substring
>   of the original string, a sequence is built as in the main proposal
>   with the string-before and string-after the match but in this case
>   between these is placed the result of executing the template.
> 
>   This avoids having to build the sequence of strings "associating"
>   each one with a template, but it is harder to suggest good values
>   for . and position() in this case.
>   possibly y position() should be 1 and . should be the original
>   string (in the case that apply-regexp-templates was just given a
>   select expression of a single string. In particular the focus would
>   not change as regexp-templates were applied. (Similar to named
>   templates.)
> 
> V1 and V2 produce (at least for templates not using an implicit
> setting of . or position()) the same results as the main proposal.
> The last variant has a different model of conflict resolution for
> overlapping matches and will typically produce different a 
> result given
> similar looking regexp.
> 
> V3: left-to-right matching.
>   Rather than match regexp in priority-order, an alternative matching
>   scheme would be to start from the start of the string and find the
>   longest possible match from all the templates under consideration.
>   priority specifications would choose between regexps if more than
>   one matched the longest possible initial string.
> 
>   The template for this regexp would fire. Processing could either
>   then immediately proceed to the remaining substring, with the system
>   finding the longest initial match on the remainder, or processing
>   could effectively stop as soon as a match was found, with teh
>   remaining string being available (say with a string-sfter-match()
>   function) and so the template could explicitly invoke
>   apply-regexp-templates select="substring-after-match()"
>   at some suitable point in its execution.
> 
>   In this model . would be the matched substring and position()
>   would be (say) always 1.
> 
> V4: Support for matching pairs.
>   Many of the examples (as in the RE-6 use cases) require matching a
>   nested structure which is beyond what is possible with a single
>   regexp. Existing XSLT facilities are sufficient to "fill the gap"
>   providing the necessary arithmetic and state to handle the nested
>   parse tree. However one could consider adding faclities to make the
>   required transformations easier. It is essentially a grouping
>   problem although the new Grouping constructs in XSLT2 didn't seem
>   immediately applicable. As an alternative to adding general grouping
>   support for this kind of task it has been suggested that a special
>   template that matches on substrings between matching tokens matched
>   by a "start" and and "end" regexp could be provided.
> 
>   <xsl:regexp-balancing-template match-start="\\func\{"
>                                match-end="\}">
>   This would take two regexp match attributes/ The template would be
>   handed the intervening string as well as the two matching strings.
>   The system would handle the necessary counting to ensure that
>   the start and end expressions were correctly paired.
> 
> 7) Suggested solutions to the use cases.
> ---------------------------------------
> 
> Only solutions using the "main proposal" are given here.
> 
> 
> RE-1
>   <xsl:template match="xx"/>
>   <div>
>   <xsl:apply-regexp-templates match="."/>
>   </div>
>   </xsl:template>
>   ...
>   <xsl:regexp-template match="&#10;">
>   <br/>
>   </xsl:regexp-template>
> 
> 
>   As commented above this doesn't use regexp but is a natural simple
>   example, that is quite hard to do in XSLT1 (or even XSLT2 as
>   currently drafted) If the input string has not been through an XML
>   parser (as will be possible in XSLT2) then even this case might
>   benefit from simple regexp support, changing the match to
>   &#13;&#10;?|&#10;
> 
> RE-2
>  <xsl:apply-regexp-templates select="'17th January, 2002'"/>
>  ...
>   <xsl:regexp-template match="^ *([0-9][0-9]) 
> +([A-Za-z]{3})[a-z]* +([0-9]+) *$">
>    <xsl:value-of select="format-number(current-match()[1],'00')"/>
>    <xsl:choose>
>    <xsl:when test="lower(current-match()[1])='jan'">-01-</xsl:when>
>    <xsl:when test="lower(current-match()[1])='feb'">-02-</xsl:when>
>    ...
>    </xsl:choose>
>    <xsl:variable name="y" select="number(current-match()[3])"/>
>    <xsl:value-of select="format-number(
>    <xsl:value-of select="if($x &lt; 10) then 
>      (1900 + $x) else 
>      (if($x &lt; 1000) then( 2000 + $x) else $x) "/>
>   </xsl:regexp-template>
> ....
> 
> 
> RE-3
>   Assuming the CSV string is the current item.
>   
>   <table>
>   <xsl:apply-regexp-templates mode="row" select="."/>
>   </table>
>   ..
>   <xsl:regexp-template mode="row" match="^(.*)$">
>   <row>
>   <xsl:apply-regexp-templates mode="cell" select="."/>
>   </row>
>   </xsl:template>
> 
>   <xsl:regexp-template mode="cell" match="([^,]*)(,|$)">
>   <cell>
>   <xsl:value-of select="current-match()[1]"/>
>   </cell>
>   </xsl:template>
>   
>   Note this assumes that ^ . and $ are line based even within a larger
>   string. (This is like emacs regexp, but unlike sed. Perl has a
>   switch that allows ^ and $ to change between matching ends of lines
>   and matching the ends of the string.)
> 
> RE-4a
>     <xsl:regexp-template mode="unicodetotex" match="\$">
>      <xsl:text>\$</xsl:text>
>     </xsl:regexp-template>
>   ...
> 
> RE-4b
>     <xsl:regexp-template mode="textounicodetotex" match="\\$">
>      <xsl:text>$</xsl:text>
>     </xsl:regexp-template>
> 
>    The only interest here would be the use or priority to control
>    tex macro names which would match the same regexp.
> 
>     <xsl:regexp-template mode="textounicodetotex" 
> match="\\lt" priority="2">
>     <xsl:regexp-template mode="textounicodetotex" match="\\l" 
> priority="1">
> 
>   although an alternative would be to explicitly end each regexp with
>   a match for a non-letter (as TeX macro names only consist of letters
>   (or a single non-letter, such as \$))
> 
> 
> RE-5
>   Given a top level param $keyword containing a word to be highlighted
>   one or more construct should allow
>   string expression such as concat('\b',$keyword,`\b') or an AVT
>                                    \b{$keyword}\b
>   in order to construct the required regexp to match this word.
> 
> RE-6a
>    <xsl:regexp-template match="\\([a-z]+){" priority="2">
>      <start name="current-match()[1]"/>
>    </xsl:regexp-template>
> 
>    <xsl:regexp-template match="{" priority="1">
>      <start name=""/>
>    </xsl:regexp-template>
> 
> 
>    <xsl:regexp-template match="}">
>      <end/>
>    </xsl:regexp-template>
> 
> 
> Applying the above regexp templates to the example \frac{1 + 
> \sin{2}}{3 + \cos{4}} would
>   produce the following sequence (of text nodes and elements), putting
>   them in a containing <x> element, so as to test the following
>   stylesheet.
> 
>    <start name="frac"/>1 + <start 
> name="sin"/>2<end/><end/><start name=""/>3 + <start 
> name="cos"/>24<end/><end/>
> 
> To get to here
>    <frac>
>      <num>1 + <sin>2</sin></num>
>      <denom>3 + <cos>4</cos></denom>
>   </frac>
> 
> we just need to use standard XSLT constructs, for example this XSLT1
> stylesheet does the job. It is however noticeable that to handle the
> nesting in this sequence the current XPath2 constructs are very
> limited, primarily due to the lack of higher order functions.
> The provided "for" operator is only really suitable when each 
> item of a
> sequence is to be processed independently. A standed 
> functional operator
> such as fold would allow accumulation of information along 
> the sequence.
> Here this problem is circumvented by first converting the sequence to
> a tree so that the sibling-axis gives the required access, but this
> seems at odds with the apparent desire to make such operations
> possible at the sequence level.
> 
> <xsl:stylesheet 
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
>     
>     <xsl:template match="x">
>     <xsl:text>
>     </xsl:text>
>     <x>
>     <xsl:apply-templates mode="match" select="node()[1]"/>
>     </x>
>     </xsl:template>
>     
>     <xsl:template match="text()" mode="match">
>     <xsl:param name="n" select="0"/>
>      <xsl:copy-of select="."/>
>      <xsl:apply-templates mode="match" 
> select="following-sibling::node()[1]">
>      <xsl:with-param name="n" select="$n"/>
>     </xsl:apply-templates>
>     </xsl:template>
>     
>     <xsl:template match="start" mode="match">
>     <xsl:param name="n" select="0"/>
>     <xsl:variable name="end" 
>      
> select="following-sibling::end[count(following-sibling::end)-c
> ount(following-sibling::start)=$n][1]"/>
>     <xsl:element name="{@name}">
>     <xsl:apply-templates 
> select="following-sibling::node()[1]" mode="match">
>      <xsl:with-param name="n" select="$n+1"/>
>     </xsl:apply-templates>
>     </xsl:element>
>     <xsl:apply-templates 
> select="$end/following-sibling::node()[1]" mode="match">
>      <xsl:with-param name="n" select="$n"/>
>     </xsl:apply-templates>
>     </xsl:template>
>     
>     <xsl:template match="start[@name='frac']" mode="match">
>     <xsl:param name="n" select="0"/>
>     <frac>
>     <num>
>     <xsl:apply-templates 
> select="following-sibling::node()[1]" mode="match">
>      <xsl:with-param name="n" select="$n+1"/>
>     </xsl:apply-templates>
>     </num>
>     <denom>
>     <xsl:apply-templates 
> select="following-sibling::end[count(following-sibling::end)-c
> ount(following-sibling::start)=$n][1]/following-sibling::node(
> )[2]" mode="match">
>      <xsl:with-param name="n" select="$n+1"/>
>     </xsl:apply-templates>
>     </denom>
>     </frac>
>     <xsl:apply-templates 
> select="following-sibling::end[count(following-sibling::end)-c
> ount(following-sibling::start)=$n][2]/following-sibling::node(
> )[1]" mode="match">
>      <xsl:with-param name="n" select="$n"/>
>     </xsl:apply-templates>
>     </xsl:template>
>     
>     <xsl:template match="end" mode="match"/>
>     
>     
>     </xsl:stylesheet>
>     
> 
> 
> RE-6b.
>   Parsing well formed XML does not really present any difficulties not
>   presented in the TeX case. The complication of macros taking more
>   than one {} group as arguments (as in the frac example above)
>   does not occur, although the regexps would need to be extended to
>   deal with empty element syntax and attributes. Full details are not
>   presented here.
> 
> RE-6c.
>   As mentioned above, the general case of parsing HTML is out of scope
>   however simple cases of omitted tags could be dealt with using the
>   priority attribute on templates.
>   a high priority template matching "</li>\s-*<li>" would handle the
>   case where the end tag was explicit, and a lower priority template
>   matching "<li>" would match in other cases, handling the implied
>   closing of the previous element.
> 
> RE-7
>  This case differs from RE-4 as there is a strong left-to-right (or at
>  least reading direction) bias. Replacements should happen at the
>  start of the string. One possibe solution is to simply prefix all
>  regexp by a ^ character to denote the start of the string.
>  One slight subtlety is that this use of ^ relies on the the fact that
>  teh string is "split up" as each string is found, and later regexp
>  apply to the sequence of remaining unmatched portions. If ^ always
>  denotes the start of the original string then prefixing all the
>  transliteration replacements by ^ would clearly not have the desired
>  effect and only the initial characters in teh original string would
>  be replaced.
> 
> RE-8
> 
> <xsl:regexp-template match="^ +([^:= ]+)\s+[:=]\s+(.*)$"
> 
> ...some standard template contains ...
>   <entry>
>     <xsl:apply-regexp-templates select="." />
>   </entry>
> 
> <!-- matches headings -->
> <xsl:regexp-template match="^([^,]+), +\([^:]+\):$">
>   <heading>
>     <xsl:value-of select="current-match()[1]" />
>     <subheading>
>       <xsl:value-of select="current-match()[2]" />
>     </subheading>
>   </heading>
> </xsl:regexp-template>
> 
> 
> 
> <!-- matches items -->
> <!-- note by having an explicit regexp to pick up the item 
> text you can
> easily extend to the case where you want to match as far as the next
> item or heading,
> here I'm using the regexp \' which is emacs-regexp for end-of-string
> (as opposed to $ which is end-of-line) so I'll  grab everything,
> for now.
> -->
> <xsl:regexp-template match="^   ([^ ].*)$(.|\n)*\'">
>   <item>
>     <heading><xsl:value-of select="current-match()[1]" /></heading>
>     <xsl:apply-regexp-templates select="current-match()[2]" />
>   </item>
> </xsl:regexp-template>
> 
> 
> <!-- matches pairs -->
> <xsl:regexp-template match="^     ([^:= 
> ]+)\s+[:=]\s+([^\(]*)(\([^\(\)]*\))?$">
>   <pair name="{current-match()[1]}"
>         value="{current-match()[2]}">
>           <xsl:apply-regexp-templates select="current-match()[3]"
>                                       mode="pair" />
>   </pair>
> </xsl:regexp-template>
> 
> 
> 
> <!-- matches nested pairs -->
> <xsl:regexp-template match="^([^:= ]+)\s+[:=]\s+([^,]+),?"
>                      mode="pair">
>   <pair name="{current-match()[1]}" value="{current-match()[2]}" />
> </xsl:regexp-template>
> 
> 
> 
> <!-- matches sentences -->
> <xsl:regexp-template match="^     ([^\.]+)." priority="-1">
>   <sentence>
>     <xsl:value-of select="concat(current-match()[1], '.')" />
>   </sentence>
> </xsl:regexp-template>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _____________________________________________________________________
> This message has been checked for all known viruses by Star Internet
> delivered through the MessageLabs Virus Scanning Service. For further
> information visit http://www.star.net.uk/stats.asp or 
> alternatively call
> Star Internet for details on the Virus Scanning Service.
> 

Received on Wednesday, 23 January 2002 13:26:44 UTC