regexp support

This got a bit longer than perhaps is wise as an "initial comment"
but some thoughts on regexp in XSLT...

David



Regular expression support in XPath/XSLT/Xquery
=============================================== 

1)Abstract
----------

This note proposes the addition of support for regular expressions in XSLT 2
beyond that currently proposed by the functions in the XPath 2 draft.

It is loosely based on discussions on and off xsl-list, principally with
Jeni Tennison, although the details of this proposal are mine and Jeni and
others who took part in the xsl-list discussions should feel free to
comment (and disagree!) with any parts of this.

It does propose some possible syntax for this functionality but this is
just a draft; the main aim of the note is to put forward some use cases and 
requirements. The exact syntax would probaby need to be refined.

One of the main issues raised by the use cases to be presented here is the
requirement to build a tree fragment (as opposed to a string) based on the
matching (or not) of regular expressions to an input string. The Functions
and Operators draft currently suggests a regexp-replace function but this
is restricted to constructing strings, so is of limited usefulness in an
XSLT (or Xquery) context.

Whilst the extended function definition possibilities in XPath 2 may, in
principle, mean that it would be possible to add such functionality to
Xpath, I think that it is a useful distinction in XSLT that should be
preserved that the principle node creation mechanism is via XSLT, with
Xpath being primarily a selection and expression language. It is quite
likely that Xquery will require similar functionality, and in Xquery a
function form may be natural, but this note concentrates on XSLT and
suggests an extension of the XSLT template mechanism. This should not
inhibit he addition of similar functionality to Xquery with a different
syntax, as differing syntax for node creation is in any case one of the
major distinguishing features between the two languages.

David Carlisle



2) Contents
----------

1) Abstract
2) Contents 
3) Regexp Syntax
4) Use Cases
5) Possible XSLT2 Regexp syntax
6) Variants on the propsosal
7) Suggested solutions to the use cases.


3) Regexp Syntax
----------------

It is clearly desirable that the regexp syntax in XPath is largely
compatible with that of XML Schema, however I feel that the requirements of
searching and replacing substrings within a larger input string (the
typical scenarios presented here) are rather different from the
requirements for specifying  regexp that fully match the (typically smaller)
complete character content of a typed element. 
Thus in my examples below I will use extended regexp syntax if this seems
appropriate staying compatible with perl wherever possible, (although
personally, I'm more familiar with the slightly different emacs
conventions).

In particular, regexp used for search and replace will need to be
unanchored and special characters will need to be introduced, for anchoring
to start and end of the string and/or lines (^ $ \z and \Z in perl regexp
syntax) and possibly other meta characters or character classes will need
to be added, depending on perceived requirements. However this note does
not concentrate on the regexp syntax but rather on the possibilities for
making structured replacements.  This is the classic "up translation" task
of moving from unstructured (or insufficiently structured) data to
structured matkup.


3) Use Cases
------------

RE-1: HTML Line Break
  In an input string replace all line end characters (which we may assume
  have already been normalised to #xA) by the element <br/>.

  This is one of the more common requests on XSL-list. It does not actually
  require any regular expression support as it is searching for a single
  character (although one may possibly want to search for other line end
  characters if the string has not been through an XML line end
  normalisation, as presumably (?) is the case for the unparsed-text()
  function.) It does however demonstrate the need to generate element nodes
  at positions determined by searching a string.

RE-2: Natural language date strings.
  Here the aim is to split up the (complete) string which is known to be a
  date in (say) English language format such as "17th January, 2002"
  and produce the string in ISO format 2002-01-17 suitable for coercing to
  a dateTime, and being used with Xpath date expressions.

RE-3: Parsing a CSV file into an XML tree fragment.
  Given an input string
  1,2
  3,4
  produce the tree fragment equivalent to 
  <table>
  <row><cell>1</cell><cell>2</cell></row>
  <row><cell>3</cell><cell>4</cell></row>
  </table>

RE-4: Multiple regexp-replace.
  The proposed replace function in F&O replaces substrings matching a
  single regexp but often one wants to replace many strings in parallel.

  I am assuming here that the normal XSLT creation model is followed that
  _all_ replacements take place (where possible, with a suitable priority
  mechanism for controlling clashes) on (substrings of) the original
  string, and a new node tree is constructed. Even when generating strings
  (as here) this differs from  the result of repeatedly calling the replace
  function proposed in the F&O draft as that would, most naturally, apply
  later regexp matching to the _result_ of earlier matches.

  An example recently mentioned on xml-dev:

RE-4a: Going from an XML unicode string to TeX:
    replace &     by \&
            $     by \$
            #169  by \copyright
            #233  by \'{e}
            <     by \lt
            #322 by \l
            ...
RE-4b: The reverse of this transformation.

RE-5: Calculation of "dynamic" regular expressions.
  Whilst most suggested "template match" extensions to regexp matching
  have assumed constant match strings, at least some support should be
  available to build up regular expressions as the result of XPath
  expressions. For example a stylesheet might accept a word (or list of
  words) as a parameter and build up the regexp adding word boundary
  markers (\b in perl) and and alternation (|). This string valued
  expression should then be usable as  a regular expression.
  (From a user perspective, all uses of regexp could be replaced by
  string valued expressions (or avt) although efficiency and other
  considerations may not allow all uses of regexp to have this
  flexibility).

RE-6: Nested structures.
  Input strings with arbitrary nested structure for example, well
  formed HTML, TeX \aaa{...} syntax, lisp (...) syntax, are not
  regular languages and so can not (by definition) be parsed by a
  single regular expression. However in many cases (including all of
  the examples cited above) the tokens delimiting the nesting may be
  matched by regular expressions. This should allow the input string
  to be tokenised using regexp into a format in which the recursion
  and/or counting required to handle the nested structure may be
  handled by standard constructs in the language (XPath or XSLT or
  Xquery).

  Some people have suggested that XSLT should be directly extended to
  support the specification of more general grammars, in the style of
  lex/yacc. But the proposal here is that regexp support, if sufficiently
  integrated into the existing functionality of the language, should be able
  to handle a large range of cases without the complications of explictly
  adding more general parsing support.

RE-6a: TeX (simplified)
  Convert 
    \\([a-z])+{....}  to <\1>...</\1>
    but with special case of
    \\frac{....}{....}  to <frac><num>...</num><denom>...</denom></frac>

  For example
   \frac{1 + \sin{2}}{3 + \cos{4}}
  to
   <frac>
     <num>1 + <sin>2</sin></num>
     <denom>3 + <cos>4</cos></denom>
  </frac>

RE-6b: Well formed XML markup
  Parse a well formed XML instance that has no DOCTYPE, entity or
  character references or attributes. (These could be added but without
  adding any major new issues.)

  For example convert the input xml
   <entry><![CDATA[<abc>12 <x/>  <wxyz>34</wxyz></abc>]]></entry>
  To
   <entry><abc>12 <x/>  <wxyz>34</wxyz></abc></entry>

RE-6c: HTML Markup.
  As above but with HTML, in particular with implied end tags. In general
  this requires a DTD and knowledge of SGML omitted tag rules. To handle
  general HTML as it appears in the wild, arbitrarily complicated "tag
  soup" parsing heuristics as implemented in the browsers would be
  needed. However this appears to be a very common requirement often
  generated by storing HTML fragments as strings in a database. One may
  hope that specific simple cases may be handled for example:

  Convert a list 
   <entry><![CDATA[<ol><li>aaa <li>bbb </ol>]]></entry>
to
   <entry><ol><li>aaa </li><li>bbb </li></ol></entry>

RE-7: Transliteration
  Take an input string in AMS cyrillic transliteration scheme and convert
  to Unicode characters. The exact scheme will be omitted here but the
  details are available at http://www.tex.org.
  This differs from the "multiple regexp" example
  in the way conflicting regexp matches need to be handled. For multiple
  regexp matching above one needs a priority mechanism so that certain
  regexp are matched first and lower priority regexp are only applied to
  remaining strings. Transliteration matches need to be applied by
  matching the start of the input string with the longest possible match,
  replacing this by the transliterated sequence, and then finding he
  longest possible match at the start of the remaining string.
  Thus if abc transliterates to X and 
          bcd transliterates to Y
          xab                   Z
          c                     C
          d                     D
  then
    abcd  -> XD
    xabcd -> ZCD
  Thus you could not, for example, start by replacing all abc by X.


RE-8: Free format text input.
      This example is based on a (real) question in xsl-list.
      (using | denote line start, ignoring indentation for this mail)

  |Some heading, with subphrases:
  |   An item without a bullet.
  |     Name = value pair.
  |     Property: value.
  |     Score = 7 (a = 1, b =3, c = 4).
  |     A full sentence that has so many words that it spans
  |         multiple lines.
  |     Sometimes we can't even trust whether people get the
  |indention consistent.
  |
  |and making it:
  |
  |<entry>
  |   <heading>
  |      Some heading
  |      <subheading>with subheading:</subheading>
  |   </heading>
  |   <item>
  |     <heading>An item without a bullet.</heading>
  |        <pair name='name' value='value pair.'/>
  |        <pair name='property' value='value.'/>
  |        <pair name='Score' value='7'>
  |            <pair name='a' value='1'/>
  |            <pair name='b' value='3'/>
  |            <pair name='c' value='4'/>
  |        </pair>
  |        <sentence>A full sentence that has so many words that it spans
  |         multiple lines.</sentence>
  |        <sentence>Sometimes we can't even trust whether people get the
  |indention consistent.</sentence>
  |   </item>
  |</entry>
  |
  |



5) Possible XSLT2 Regexp syntax
-------------------------------

The basic idea outlined in the proposal below is that the main task in
all the above use cases is the construction of a result tree given
some input. The construction aspects of the new functionality should
therefore be designed to match existing construction possibilities,
with the only difference being that they are triggered by a substring
of an input string matching a regexp rather than a node in an input
tree matching an Xpath.

A new instruction: <xsl:apply-regexp-templates>

 Taking same attributes and content as apply-templates.
 The select attribute should evaluate to a sequence of string-valued
 items. If more than one string is in the sequence, the result is the
 sequence produced by concatenating the result of processing each
 string. In fact all the examples presented will always have a
 sequence of at most one (that is, a string) and it would be possible
 to specify that the argument should be a single string if that proves
 to be a useful simplification.

A new top level instruction: <xsl:regexp-template>

  Taking a mandatory match attribute
  optional priority attribute.
  and optional mode attribute.

  The mode attribute works as for xsl:template, if xsl:apply-regexp
  templates specifies a mode then only regexp-templates's declared for
  that mode will be considered.

  The match attribute takes a regular expression, ie a restricted form
  of string. It is assumed here that these are essentially fixed
  strings, if implementation/efficiency concerns allow they could
  perhaps be attribute value templates to allow more dynamic choice in
  the regular expressions. Or, equivalently to AVT, but with slightly
  different syntax, the match attribute could take arbitrary xpath
  expressions  so long as they evaluated to a string that was a legal
  regexp. (As another variant not further explored here one could
  consider regexp to be a derived type from string rather than typing
  regexps as strings and just stating at a meta level that they have
  to match regexp syntax).

So a typical example, meeting the first use case, would be
  <xsl:template match="xx"/>
  <div>
  <xsl:apply-regexp-templates match="."/>
  </div>
  </xsl:template>
  ...
  <xsl:regexp-template match="&#10;">
  <br/>
  </xsl:regexp-template>

  The template matching <xx> would then result in the character data
  of xx being copied into a div element in the output, with all new
  line characters become html br elements.

A new XSLT-specific XPath function current-match().
  Within a regexp-template current-match will return a sequence of 1
  or more strings the ith item being the substring matching the ith
  parenthised expression in the regexp of the template.
  thus given a regexp of (a*)(b*) matching aaabb then within the
  template, . will be "aaabb" current-match()[1] will be "aaa" and
  current-match()[2] will be "bb". In the presence of alternation (|)
  and repeat clauses ({3}) it isn't always immediately clear how
  subexpressions should be numbered but perl semantics should be
  followed (as schema explicitly tries to be perl like in its regexp
  semantics).
  

In detail the execution model would be as follows.

a. The select attribute of apply-regexp-templates is evaluated.
   If it is not a sequence of strings an error is raised.
   If it is a sequence each is processed separately and the result is
   the sequence of results.
   So we need only consider a single string.

b. If a mode is supplied all regexp-templates in that mode are now
   consided, otherwise all the regexp templates in the default mode
   are considered.

c. The templates being considered are then ordered by priority.
   If two templates of equal priority could potentially match
   overlapping strings then eitherthis would be an error or a default
   priority scheme would enforce an ordering (either implementation
   defined or order in stylesheet) (to be decided).

   For each template in turn, starting with highest priority,
   the regexp is matched on the subsequences of the original input
   string that have yet to be matched. Once a match is found the
   sequence of substrings is extended by splitting the current
   substring into three: the substring-before the matched substring
   and the substring after. This continues, finally using a default
   template of which matches every character "(.|&#10;)" until the end
   result is that the original string is now a sequence of
   substrings each associated with a template whose match regexp
   produced the substring.
   Now the focus is set such that this derived sequence is the current
   sequence, and each of the associated regexp-templates is executed
   in order with the current item being the matched substring and
   position() being the position in the derived sequence.

   This means that the regexp's can access the matched string using .
   finer control, accessing subexpressions can be achieved using
   the match-string() function described above.

d. The resulting sequence is the concatenation of the results of each
   of these templates.




6) Variants on the propsosal
----------------------------
This section discusses some variants on the above. The variations are
often mutually exclusive, it is not proposed that all these features
are added simultaneously, although of course a system that
incorporated some features of more than one of these variants would
also be possible.

V1: Tokenise Function
  It would be possible to make the implicit splitting up of the input
  string into a sequence of substrings explicit.
  <xsl:apply-regexp-templates select="$string"/>
  would be replaced by  
  <xsl:apply-regexp-templates select="regexp-tokenise($string)"/>
  where regexp-tokenise would a set of regexp's (specified by a method
  to be determined) and split up the string as above. The difference
  would be that the sequence resulting would "just" be a sequence of
  strings, which would make it a first class object in the data model.
  Of course apply-regexp-templates would take any sequence of strings
  they would not be forced to be generated by tokenise() (although in
  practice that would be most common).

  In this model the semantics of the match expression in
  <xsl:regexp-template
  is slightly different. rather than matching on a substring of some
  input string, it would be an _anchored_ match and the template would
  fire if the regexp matched an entire string in the input sequence.
  As in the main proposal . and position() etc would reflect the
  position of the matched item in the input sequence.

  The advantage of this model is that the sequence is made explicit
  as a standard XPath sequence. The disadvantage is that the regexps
  may have to be specified (and executed) twice. Once as unanchored
  regexps to tokenise the input string into a sequence of substrings,
  and then again as anchored regexps to associate templates with each
  of the matched substrings.


V2: Immediate template execution
  Rather than first building an implicit sequence of substrings the
  mechanism could be that as each regexp is matched against a substring
  of the original string, a sequence is built as in the main proposal
  with the string-before and string-after the match but in this case
  between these is placed the result of executing the template.

  This avoids having to build the sequence of strings "associating"
  each one with a template, but it is harder to suggest good values
  for . and position() in this case.
  possibly y position() should be 1 and . should be the original
  string (in the case that apply-regexp-templates was just given a
  select expression of a single string. In particular the focus would
  not change as regexp-templates were applied. (Similar to named
  templates.)

V1 and V2 produce (at least for templates not using an implicit
setting of . or position()) the same results as the main proposal.
The last variant has a different model of conflict resolution for
overlapping matches and will typically produce different a result given
similar looking regexp.

V3: left-to-right matching.
  Rather than match regexp in priority-order, an alternative matching
  scheme would be to start from the start of the string and find the
  longest possible match from all the templates under consideration.
  priority specifications would choose between regexps if more than
  one matched the longest possible initial string.

  The template for this regexp would fire. Processing could either
  then immediately proceed to the remaining substring, with the system
  finding the longest initial match on the remainder, or processing
  could effectively stop as soon as a match was found, with teh
  remaining string being available (say with a string-sfter-match()
  function) and so the template could explicitly invoke
  apply-regexp-templates select="substring-after-match()"
  at some suitable point in its execution.

  In this model . would be the matched substring and position()
  would be (say) always 1.

V4: Support for matching pairs.
  Many of the examples (as in the RE-6 use cases) require matching a
  nested structure which is beyond what is possible with a single
  regexp. Existing XSLT facilities are sufficient to "fill the gap"
  providing the necessary arithmetic and state to handle the nested
  parse tree. However one could consider adding faclities to make the
  required transformations easier. It is essentially a grouping
  problem although the new Grouping constructs in XSLT2 didn't seem
  immediately applicable. As an alternative to adding general grouping
  support for this kind of task it has been suggested that a special
  template that matches on substrings between matching tokens matched
  by a "start" and and "end" regexp could be provided.

  <xsl:regexp-balancing-template match-start="\\func\{"
                               match-end="\}">
  This would take two regexp match attributes/ The template would be
  handed the intervening string as well as the two matching strings.
  The system would handle the necessary counting to ensure that
  the start and end expressions were correctly paired.

7) Suggested solutions to the use cases.
---------------------------------------

Only solutions using the "main proposal" are given here.


RE-1
  <xsl:template match="xx"/>
  <div>
  <xsl:apply-regexp-templates match="."/>
  </div>
  </xsl:template>
  ...
  <xsl:regexp-template match="&#10;">
  <br/>
  </xsl:regexp-template>


  As commented above this doesn't use regexp but is a natural simple
  example, that is quite hard to do in XSLT1 (or even XSLT2 as
  currently drafted) If the input string has not been through an XML
  parser (as will be possible in XSLT2) then even this case might
  benefit from simple regexp support, changing the match to
  &#13;&#10;?|&#10;

RE-2
 <xsl:apply-regexp-templates select="'17th January, 2002'"/>
 ...
  <xsl:regexp-template match="^ *([0-9][0-9]) +([A-Za-z]{3})[a-z]* +([0-9]+) *$">
   <xsl:value-of select="format-number(current-match()[1],'00')"/>
   <xsl:choose>
   <xsl:when test="lower(current-match()[1])='jan'">-01-</xsl:when>
   <xsl:when test="lower(current-match()[1])='feb'">-02-</xsl:when>
   ...
   </xsl:choose>
   <xsl:variable name="y" select="number(current-match()[3])"/>
   <xsl:value-of select="format-number(
   <xsl:value-of select="if($x &lt; 10) then 
     (1900 + $x) else 
     (if($x &lt; 1000) then( 2000 + $x) else $x) "/>
  </xsl:regexp-template>
....


RE-3
  Assuming the CSV string is the current item.
  
  <table>
  <xsl:apply-regexp-templates mode="row" select="."/>
  </table>
  ..
  <xsl:regexp-template mode="row" match="^(.*)$">
  <row>
  <xsl:apply-regexp-templates mode="cell" select="."/>
  </row>
  </xsl:template>

  <xsl:regexp-template mode="cell" match="([^,]*)(,|$)">
  <cell>
  <xsl:value-of select="current-match()[1]"/>
  </cell>
  </xsl:template>
  
  Note this assumes that ^ . and $ are line based even within a larger
  string. (This is like emacs regexp, but unlike sed. Perl has a
  switch that allows ^ and $ to change between matching ends of lines
  and matching the ends of the string.)

RE-4a
    <xsl:regexp-template mode="unicodetotex" match="\$">
     <xsl:text>\$</xsl:text>
    </xsl:regexp-template>
  ...

RE-4b
    <xsl:regexp-template mode="textounicodetotex" match="\\$">
     <xsl:text>$</xsl:text>
    </xsl:regexp-template>

   The only interest here would be the use or priority to control
   tex macro names which would match the same regexp.

    <xsl:regexp-template mode="textounicodetotex" match="\\lt" priority="2">
    <xsl:regexp-template mode="textounicodetotex" match="\\l" priority="1">

  although an alternative would be to explicitly end each regexp with
  a match for a non-letter (as TeX macro names only consist of letters
  (or a single non-letter, such as \$))


RE-5
  Given a top level param $keyword containing a word to be highlighted
  one or more construct should allow
  string expression such as concat('\b',$keyword,`\b') or an AVT
                                   \b{$keyword}\b
  in order to construct the required regexp to match this word.

RE-6a
   <xsl:regexp-template match="\\([a-z]+){" priority="2">
     <start name="current-match()[1]"/>
   </xsl:regexp-template>

   <xsl:regexp-template match="{" priority="1">
     <start name=""/>
   </xsl:regexp-template>


   <xsl:regexp-template match="}">
     <end/>
   </xsl:regexp-template>


Applying the above regexp templates to the example \frac{1 + \sin{2}}{3 + \cos{4}} would
  produce the following sequence (of text nodes and elements), putting
  them in a containing <x> element, so as to test the following
  stylesheet.

   <start name="frac"/>1 + <start name="sin"/>2<end/><end/><start name=""/>3 + <start name="cos"/>24<end/><end/>

To get to here
   <frac>
     <num>1 + <sin>2</sin></num>
     <denom>3 + <cos>4</cos></denom>
  </frac>

we just need to use standard XSLT constructs, for example this XSLT1
stylesheet does the job. It is however noticeable that to handle the
nesting in this sequence the current XPath2 constructs are very
limited, primarily due to the lack of higher order functions.
The provided "for" operator is only really suitable when each item of a
sequence is to be processed independently. A standed functional operator
such as fold would allow accumulation of information along the sequence.
Here this problem is circumvented by first converting the sequence to
a tree so that the sibling-axis gives the required access, but this
seems at odds with the apparent desire to make such operations
possible at the sequence level.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    
    <xsl:template match="x">
    <xsl:text>
    </xsl:text>
    <x>
    <xsl:apply-templates mode="match" select="node()[1]"/>
    </x>
    </xsl:template>
    
    <xsl:template match="text()" mode="match">
    <xsl:param name="n" select="0"/>
     <xsl:copy-of select="."/>
     <xsl:apply-templates mode="match" select="following-sibling::node()[1]">
     <xsl:with-param name="n" select="$n"/>
    </xsl:apply-templates>
    </xsl:template>
    
    <xsl:template match="start" mode="match">
    <xsl:param name="n" select="0"/>
    <xsl:variable name="end" 
     select="following-sibling::end[count(following-sibling::end)-count(following-sibling::start)=$n][1]"/>
    <xsl:element name="{@name}">
    <xsl:apply-templates select="following-sibling::node()[1]" mode="match">
     <xsl:with-param name="n" select="$n+1"/>
    </xsl:apply-templates>
    </xsl:element>
    <xsl:apply-templates select="$end/following-sibling::node()[1]" mode="match">
     <xsl:with-param name="n" select="$n"/>
    </xsl:apply-templates>
    </xsl:template>
    
    <xsl:template match="start[@name='frac']" mode="match">
    <xsl:param name="n" select="0"/>
    <frac>
    <num>
    <xsl:apply-templates select="following-sibling::node()[1]" mode="match">
     <xsl:with-param name="n" select="$n+1"/>
    </xsl:apply-templates>
    </num>
    <denom>
    <xsl:apply-templates select="following-sibling::end[count(following-sibling::end)-count(following-sibling::start)=$n][1]/following-sibling::node()[2]" mode="match">
     <xsl:with-param name="n" select="$n+1"/>
    </xsl:apply-templates>
    </denom>
    </frac>
    <xsl:apply-templates select="following-sibling::end[count(following-sibling::end)-count(following-sibling::start)=$n][2]/following-sibling::node()[1]" mode="match">
     <xsl:with-param name="n" select="$n"/>
    </xsl:apply-templates>
    </xsl:template>
    
    <xsl:template match="end" mode="match"/>
    
    
    </xsl:stylesheet>
    


RE-6b.
  Parsing well formed XML does not really present any difficulties not
  presented in the TeX case. The complication of macros taking more
  than one {} group as arguments (as in the frac example above)
  does not occur, although the regexps would need to be extended to
  deal with empty element syntax and attributes. Full details are not
  presented here.

RE-6c.
  As mentioned above, the general case of parsing HTML is out of scope
  however simple cases of omitted tags could be dealt with using the
  priority attribute on templates.
  a high priority template matching "</li>\s-*<li>" would handle the
  case where the end tag was explicit, and a lower priority template
  matching "<li>" would match in other cases, handling the implied
  closing of the previous element.

RE-7
 This case differs from RE-4 as there is a strong left-to-right (or at
 least reading direction) bias. Replacements should happen at the
 start of the string. One possibe solution is to simply prefix all
 regexp by a ^ character to denote the start of the string.
 One slight subtlety is that this use of ^ relies on the the fact that
 teh string is "split up" as each string is found, and later regexp
 apply to the sequence of remaining unmatched portions. If ^ always
 denotes the start of the original string then prefixing all the
 transliteration replacements by ^ would clearly not have the desired
 effect and only the initial characters in teh original string would
 be replaced.

RE-8

<xsl:regexp-template match="^ +([^:= ]+)\s+[:=]\s+(.*)$"

...some standard template contains ...
  <entry>
    <xsl:apply-regexp-templates select="." />
  </entry>

<!-- matches headings -->
<xsl:regexp-template match="^([^,]+), +\([^:]+\):$">
  <heading>
    <xsl:value-of select="current-match()[1]" />
    <subheading>
      <xsl:value-of select="current-match()[2]" />
    </subheading>
  </heading>
</xsl:regexp-template>



<!-- matches items -->
<!-- note by having an explicit regexp to pick up the item text you can
easily extend to the case where you want to match as far as the next
item or heading,
here I'm using the regexp \' which is emacs-regexp for end-of-string
(as opposed to $ which is end-of-line) so I'll  grab everything,
for now.
-->
<xsl:regexp-template match="^   ([^ ].*)$(.|\n)*\'">
  <item>
    <heading><xsl:value-of select="current-match()[1]" /></heading>
    <xsl:apply-regexp-templates select="current-match()[2]" />
  </item>
</xsl:regexp-template>


<!-- matches pairs -->
<xsl:regexp-template match="^     ([^:= ]+)\s+[:=]\s+([^\(]*)(\([^\(\)]*\))?$">
  <pair name="{current-match()[1]}"
        value="{current-match()[2]}">
          <xsl:apply-regexp-templates select="current-match()[3]"
                                      mode="pair" />
  </pair>
</xsl:regexp-template>



<!-- matches nested pairs -->
<xsl:regexp-template match="^([^:= ]+)\s+[:=]\s+([^,]+),?"
                     mode="pair">
  <pair name="{current-match()[1]}" value="{current-match()[2]}" />
</xsl:regexp-template>



<!-- matches sentences -->
<xsl:regexp-template match="^     ([^\.]+)." priority="-1">
  <sentence>
    <xsl:value-of select="concat(current-match()[1], '.')" />
  </sentence>
</xsl:regexp-template>









_____________________________________________________________________
This message has been checked for all known viruses by Star Internet
delivered through the MessageLabs Virus Scanning Service. For further
information visit http://www.star.net.uk/stats.asp or alternatively call
Star Internet for details on the Virus Scanning Service.

Received on Tuesday, 22 January 2002 18:58:36 UTC