# RE: regexp support

From: Kay, Michael <Michael.Kay@softwareag.com>
Date: Wed, 23 Jan 2002 19:26:35 +0100
Message-ID: <DFF2AC9E3583D511A21F0008C7E6210602679BAC@daemsg02.software-ag.de>
To: "'David Carlisle'" <davidc@nag.co.uk>, xsl-editors@w3.org
Many thanks for this note. We had a discussion on regular expressions at the
XSL WG meeting earlier this week; we agreed that the facilities currently
specified in Functions & Operators are probably inadequate to meet the XSLT
requirement, and that it would be useful to start by producing a set of use
cases to test any proposed improvements against. Your input is therefore
very valuable.

Mike Kay

> -----Original Message-----
> From: David Carlisle [mailto:davidc@nag.co.uk]
> Sent: 22 January 2002 23:58
> To: xsl-editors@w3.org
> Subject: regexp support
>
>
>
> This got a bit longer than perhaps is wise as an "initial comment"
> but some thoughts on regexp in XSLT...
>
> David
>
>
>
> Regular expression support in XPath/XSLT/Xquery
> ===============================================
>
> 1)Abstract
> ----------
>
> This note proposes the addition of support for regular
> expressions in XSLT 2
> beyond that currently proposed by the functions in the XPath 2 draft.
>
> It is loosely based on discussions on and off xsl-list,
> principally with
> Jeni Tennison, although the details of this proposal are mine
> and Jeni and
> others who took part in the xsl-list discussions should feel free to
> comment (and disagree!) with any parts of this.
>
> It does propose some possible syntax for this functionality
> but this is
> just a draft; the main aim of the note is to put forward some
> use cases and
> requirements. The exact syntax would probaby need to be refined.
>
> One of the main issues raised by the use cases to be
> presented here is the
> requirement to build a tree fragment (as opposed to a string)
> based on the
> matching (or not) of regular expressions to an input string.
> The Functions
> and Operators draft currently suggests a regexp-replace
> function but this
> is restricted to constructing strings, so is of limited
> usefulness in an
> XSLT (or Xquery) context.
>
> Whilst the extended function definition possibilities in
> XPath 2 may, in
> principle, mean that it would be possible to add such functionality to
> Xpath, I think that it is a useful distinction in XSLT that should be
> preserved that the principle node creation mechanism is via XSLT, with
> Xpath being primarily a selection and expression language. It is quite
> likely that Xquery will require similar functionality, and in Xquery a
> function form may be natural, but this note concentrates on XSLT and
> suggests an extension of the XSLT template mechanism. This should not
> inhibit he addition of similar functionality to Xquery with a
> different
> syntax, as differing syntax for node creation is in any case
> one of the
> major distinguishing features between the two languages.
>
> David Carlisle
>
>
>
> 2) Contents
> ----------
>
> 1) Abstract
> 2) Contents
> 3) Regexp Syntax
> 4) Use Cases
> 5) Possible XSLT2 Regexp syntax
> 6) Variants on the propsosal
> 7) Suggested solutions to the use cases.
>
>
> 3) Regexp Syntax
> ----------------
>
> It is clearly desirable that the regexp syntax in XPath is largely
> compatible with that of XML Schema, however I feel that the
> requirements of
> searching and replacing substrings within a larger input string (the
> typical scenarios presented here) are rather different from the
> requirements for specifying  regexp that fully match the
> (typically smaller)
> complete character content of a typed element.
> Thus in my examples below I will use extended regexp syntax
> if this seems
> appropriate staying compatible with perl wherever possible, (although
> personally, I'm more familiar with the slightly different emacs
> conventions).
>
> In particular, regexp used for search and replace will need to be
> unanchored and special characters will need to be introduced,
> for anchoring
> to start and end of the string and/or lines (^ $\z and \Z in > perl regexp > syntax) and possibly other meta characters or character > classes will need > to be added, depending on perceived requirements. However > this note does > not concentrate on the regexp syntax but rather on the > possibilities for > making structured replacements. This is the classic "up > translation" task > of moving from unstructured (or insufficiently structured) data to > structured matkup. > > > 3) Use Cases > ------------ > > RE-1: HTML Line Break > In an input string replace all line end characters (which > we may assume > have already been normalised to #xA) by the element <br/>. > > This is one of the more common requests on XSL-list. It > does not actually > require any regular expression support as it is searching > for a single > character (although one may possibly want to search for > other line end > characters if the string has not been through an XML line end > normalisation, as presumably (?) is the case for the unparsed-text() > function.) It does however demonstrate the need to generate > element nodes > at positions determined by searching a string. > > RE-2: Natural language date strings. > Here the aim is to split up the (complete) string which is > known to be a > date in (say) English language format such as "17th January, 2002" > and produce the string in ISO format 2002-01-17 suitable > for coercing to > a dateTime, and being used with Xpath date expressions. > > RE-3: Parsing a CSV file into an XML tree fragment. > Given an input string > 1,2 > 3,4 > produce the tree fragment equivalent to > <table> > <row><cell>1</cell><cell>2</cell></row> > <row><cell>3</cell><cell>4</cell></row> > </table> > > RE-4: Multiple regexp-replace. > The proposed replace function in F&O replaces substrings matching a > single regexp but often one wants to replace many strings > in parallel. > > I am assuming here that the normal XSLT creation model is > followed that > _all_ replacements take place (where possible, with a > suitable priority > mechanism for controlling clashes) on (substrings of) the original > string, and a new node tree is constructed. Even when > generating strings > (as here) this differs from the result of repeatedly > calling the replace > function proposed in the F&O draft as that would, most > naturally, apply > later regexp matching to the _result_ of earlier matches. > > An example recently mentioned on xml-dev: > > RE-4a: Going from an XML unicode string to TeX: > replace & by \& >$     by \$> #169 by \copyright > #233 by \'{e} > < by \lt > #322 by \l > ... > RE-4b: The reverse of this transformation. > > RE-5: Calculation of "dynamic" regular expressions. > Whilst most suggested "template match" extensions to regexp matching > have assumed constant match strings, at least some support should be > available to build up regular expressions as the result of XPath > expressions. For example a stylesheet might accept a word > (or list of > words) as a parameter and build up the regexp adding word boundary > markers (\b in perl) and and alternation (|). This string valued > expression should then be usable as a regular expression. > (From a user perspective, all uses of regexp could be replaced by > string valued expressions (or avt) although efficiency and other > considerations may not allow all uses of regexp to have this > flexibility). > > RE-6: Nested structures. > Input strings with arbitrary nested structure for example, well > formed HTML, TeX \aaa{...} syntax, lisp (...) syntax, are not > regular languages and so can not (by definition) be parsed by a > single regular expression. However in many cases (including all of > the examples cited above) the tokens delimiting the nesting may be > matched by regular expressions. This should allow the input string > to be tokenised using regexp into a format in which the recursion > and/or counting required to handle the nested structure may be > handled by standard constructs in the language (XPath or XSLT or > Xquery). > > Some people have suggested that XSLT should be directly extended to > support the specification of more general grammars, in the style of > lex/yacc. But the proposal here is that regexp support, if > sufficiently > integrated into the existing functionality of the language, > should be able > to handle a large range of cases without the complications > of explictly > adding more general parsing support. > > RE-6a: TeX (simplified) > Convert > \$$[a-z])+{....} to <\1>...</\1> > but with special case of > \\frac{....}{....} to > <frac><num>...</num><denom>...</denom></frac> > > For example > \frac{1 + \sin{2}}{3 + \cos{4}} > to > <frac> > <num>1 + <sin>2</sin></num> > <denom>3 + <cos>4</cos></denom> > </frac> > > RE-6b: Well formed XML markup > Parse a well formed XML instance that has no DOCTYPE, entity or > character references or attributes. (These could be added > but without > adding any major new issues.) > > For example convert the input xml > <entry><![CDATA[<abc>12 <x/> <wxyz>34</wxyz></abc>]]></entry> > To > <entry><abc>12 <x/> <wxyz>34</wxyz></abc></entry> > > RE-6c: HTML Markup. > As above but with HTML, in particular with implied end > tags. In general > this requires a DTD and knowledge of SGML omitted tag > rules. To handle > general HTML as it appears in the wild, arbitrarily complicated "tag > soup" parsing heuristics as implemented in the browsers would be > needed. However this appears to be a very common requirement often > generated by storing HTML fragments as strings in a > database. One may > hope that specific simple cases may be handled for example: > > Convert a list > <entry><![CDATA[<ol><li>aaa <li>bbb </ol>]]></entry> > to > <entry><ol><li>aaa </li><li>bbb </li></ol></entry> > > RE-7: Transliteration > Take an input string in AMS cyrillic transliteration scheme > and convert > to Unicode characters. The exact scheme will be omitted here but the > details are available at http://www.tex.org. > This differs from the "multiple regexp" example > in the way conflicting regexp matches need to be handled. > For multiple > regexp matching above one needs a priority mechanism so that certain > regexp are matched first and lower priority regexp are only > applied to > remaining strings. Transliteration matches need to be applied by > matching the start of the input string with the longest > possible match, > replacing this by the transliterated sequence, and then finding he > longest possible match at the start of the remaining string. > Thus if abc transliterates to X and > bcd transliterates to Y > xab Z > c C > d D > then > abcd -> XD > xabcd -> ZCD > Thus you could not, for example, start by replacing all abc by X. > > > RE-8: Free format text input. > This example is based on a (real) question in xsl-list. > (using | denote line start, ignoring indentation for this mail) > > |Some heading, with subphrases: > | An item without a bullet. > | Name = value pair. > | Property: value. > | Score = 7 (a = 1, b =3, c = 4). > | A full sentence that has so many words that it spans > | multiple lines. > | Sometimes we can't even trust whether people get the > |indention consistent. > | > |and making it: > | > |<entry> > | <heading> > | Some heading > | <subheading>with subheading:</subheading> > | </heading> > | <item> > | <heading>An item without a bullet.</heading> > | <pair name='name' value='value pair.'/> > | <pair name='property' value='value.'/> > | <pair name='Score' value='7'> > | <pair name='a' value='1'/> > | <pair name='b' value='3'/> > | <pair name='c' value='4'/> > | </pair> > | <sentence>A full sentence that has so many words > that it spans > | multiple lines.</sentence> > | <sentence>Sometimes we can't even trust whether > people get the > |indention consistent.</sentence> > | </item> > |</entry> > | > | > > > > 5) Possible XSLT2 Regexp syntax > ------------------------------- > > The basic idea outlined in the proposal below is that the main task in > all the above use cases is the construction of a result tree given > some input. The construction aspects of the new functionality should > therefore be designed to match existing construction possibilities, > with the only difference being that they are triggered by a substring > of an input string matching a regexp rather than a node in an input > tree matching an Xpath. > > A new instruction: <xsl:apply-regexp-templates> > > Taking same attributes and content as apply-templates. > The select attribute should evaluate to a sequence of string-valued > items. If more than one string is in the sequence, the result is the > sequence produced by concatenating the result of processing each > string. In fact all the examples presented will always have a > sequence of at most one (that is, a string) and it would be possible > to specify that the argument should be a single string if that proves > to be a useful simplification. > > A new top level instruction: <xsl:regexp-template> > > Taking a mandatory match attribute > optional priority attribute. > and optional mode attribute. > > The mode attribute works as for xsl:template, if xsl:apply-regexp > templates specifies a mode then only regexp-templates's declared for > that mode will be considered. > > The match attribute takes a regular expression, ie a restricted form > of string. It is assumed here that these are essentially fixed > strings, if implementation/efficiency concerns allow they could > perhaps be attribute value templates to allow more dynamic choice in > the regular expressions. Or, equivalently to AVT, but with slightly > different syntax, the match attribute could take arbitrary xpath > expressions so long as they evaluated to a string that was a legal > regexp. (As another variant not further explored here one could > consider regexp to be a derived type from string rather than typing > regexps as strings and just stating at a meta level that they have > to match regexp syntax). > > So a typical example, meeting the first use case, would be > <xsl:template match="xx"/> > <div> > <xsl:apply-regexp-templates match="."/> > </div> > </xsl:template> > ... > <xsl:regexp-template match="&#10;"> > <br/> > </xsl:regexp-template> > > The template matching <xx> would then result in the character data > of xx being copied into a div element in the output, with all new > line characters become html br elements. > > A new XSLT-specific XPath function current-match(). > Within a regexp-template current-match will return a sequence of 1 > or more strings the ith item being the substring matching the ith > parenthised expression in the regexp of the template. > thus given a regexp of (a*)(b*) matching aaabb then within the > template, . will be "aaabb" current-match()[1] will be "aaa" and > current-match()[2] will be "bb". In the presence of alternation (|) > and repeat clauses ({3}) it isn't always immediately clear how > subexpressions should be numbered but perl semantics should be > followed (as schema explicitly tries to be perl like in its regexp > semantics). > > > In detail the execution model would be as follows. > > a. The select attribute of apply-regexp-templates is evaluated. > If it is not a sequence of strings an error is raised. > If it is a sequence each is processed separately and the result is > the sequence of results. > So we need only consider a single string. > > b. If a mode is supplied all regexp-templates in that mode are now > consided, otherwise all the regexp templates in the default mode > are considered. > > c. The templates being considered are then ordered by priority. > If two templates of equal priority could potentially match > overlapping strings then eitherthis would be an error or a default > priority scheme would enforce an ordering (either implementation > defined or order in stylesheet) (to be decided). > > For each template in turn, starting with highest priority, > the regexp is matched on the subsequences of the original input > string that have yet to be matched. Once a match is found the > sequence of substrings is extended by splitting the current > substring into three: the substring-before the matched substring > and the substring after. This continues, finally using a default > template of which matches every character "(.|&#10;)" until the end > result is that the original string is now a sequence of > substrings each associated with a template whose match regexp > produced the substring. > Now the focus is set such that this derived sequence is the current > sequence, and each of the associated regexp-templates is executed > in order with the current item being the matched substring and > position() being the position in the derived sequence. > > This means that the regexp's can access the matched string using . > finer control, accessing subexpressions can be achieved using > the match-string() function described above. > > d. The resulting sequence is the concatenation of the results of each > of these templates. > > > > > 6) Variants on the propsosal > ---------------------------- > This section discusses some variants on the above. The variations are > often mutually exclusive, it is not proposed that all these features > are added simultaneously, although of course a system that > incorporated some features of more than one of these variants would > also be possible. > > V1: Tokenise Function > It would be possible to make the implicit splitting up of the input > string into a sequence of substrings explicit. > <xsl:apply-regexp-templates select="string"/> > would be replaced by > <xsl:apply-regexp-templates select="regexp-tokenise(string)"/> > where regexp-tokenise would a set of regexp's (specified by a method > to be determined) and split up the string as above. The difference > would be that the sequence resulting would "just" be a sequence of > strings, which would make it a first class object in the data model. > Of course apply-regexp-templates would take any sequence of strings > they would not be forced to be generated by tokenise() (although in > practice that would be most common). > > In this model the semantics of the match expression in > <xsl:regexp-template > is slightly different. rather than matching on a substring of some > input string, it would be an _anchored_ match and the template would > fire if the regexp matched an entire string in the input sequence. > As in the main proposal . and position() etc would reflect the > position of the matched item in the input sequence. > > The advantage of this model is that the sequence is made explicit > as a standard XPath sequence. The disadvantage is that the regexps > may have to be specified (and executed) twice. Once as unanchored > regexps to tokenise the input string into a sequence of substrings, > and then again as anchored regexps to associate templates with each > of the matched substrings. > > > V2: Immediate template execution > Rather than first building an implicit sequence of substrings the > mechanism could be that as each regexp is matched against a > substring > of the original string, a sequence is built as in the main proposal > with the string-before and string-after the match but in this case > between these is placed the result of executing the template. > > This avoids having to build the sequence of strings "associating" > each one with a template, but it is harder to suggest good values > for . and position() in this case. > possibly y position() should be 1 and . should be the original > string (in the case that apply-regexp-templates was just given a > select expression of a single string. In particular the focus would > not change as regexp-templates were applied. (Similar to named > templates.) > > V1 and V2 produce (at least for templates not using an implicit > setting of . or position()) the same results as the main proposal. > The last variant has a different model of conflict resolution for > overlapping matches and will typically produce different a > result given > similar looking regexp. > > V3: left-to-right matching. > Rather than match regexp in priority-order, an alternative matching > scheme would be to start from the start of the string and find the > longest possible match from all the templates under consideration. > priority specifications would choose between regexps if more than > one matched the longest possible initial string. > > The template for this regexp would fire. Processing could either > then immediately proceed to the remaining substring, with the system > finding the longest initial match on the remainder, or processing > could effectively stop as soon as a match was found, with teh > remaining string being available (say with a string-sfter-match() > function) and so the template could explicitly invoke > apply-regexp-templates select="substring-after-match()" > at some suitable point in its execution. > > In this model . would be the matched substring and position() > would be (say) always 1. > > V4: Support for matching pairs. > Many of the examples (as in the RE-6 use cases) require matching a > nested structure which is beyond what is possible with a single > regexp. Existing XSLT facilities are sufficient to "fill the gap" > providing the necessary arithmetic and state to handle the nested > parse tree. However one could consider adding faclities to make the > required transformations easier. It is essentially a grouping > problem although the new Grouping constructs in XSLT2 didn't seem > immediately applicable. As an alternative to adding general grouping > support for this kind of task it has been suggested that a special > template that matches on substrings between matching tokens matched > by a "start" and and "end" regexp could be provided. > > <xsl:regexp-balancing-template match-start="\\func\{" > match-end="\}"> > This would take two regexp match attributes/ The template would be > handed the intervening string as well as the two matching strings. > The system would handle the necessary counting to ensure that > the start and end expressions were correctly paired. > > 7) Suggested solutions to the use cases. > --------------------------------------- > > Only solutions using the "main proposal" are given here. > > > RE-1 > <xsl:template match="xx"/> > <div> > <xsl:apply-regexp-templates match="."/> > </div> > </xsl:template> > ... > <xsl:regexp-template match="&#10;"> > <br/> > </xsl:regexp-template> > > > As commented above this doesn't use regexp but is a natural simple > example, that is quite hard to do in XSLT1 (or even XSLT2 as > currently drafted) If the input string has not been through an XML > parser (as will be possible in XSLT2) then even this case might > benefit from simple regexp support, changing the match to > &#13;&#10;?|&#10; > > RE-2 > <xsl:apply-regexp-templates select="'17th January, 2002'"/> > ... > <xsl:regexp-template match="^ *([0-9][0-9]) > +([A-Za-z]{3})[a-z]* +([0-9]+) *"> > <xsl:value-of select="format-number(current-match()[1],'00')"/> > <xsl:choose> > <xsl:when test="lower(current-match()[1])='jan'">-01-</xsl:when> > <xsl:when test="lower(current-match()[1])='feb'">-02-</xsl:when> > ... > </xsl:choose> > <xsl:variable name="y" select="number(current-match()[3])"/> > <xsl:value-of select="format-number( > <xsl:value-of select="if(x &lt; 10) then > (1900 + x) else > (if(x &lt; 1000) then( 2000 + x) else x) "/> > </xsl:regexp-template> > .... > > > RE-3 > Assuming the CSV string is the current item. > > <table> > <xsl:apply-regexp-templates mode="row" select="."/> > </table> > .. > <xsl:regexp-template mode="row" match="^(.*)"> > <row> > <xsl:apply-regexp-templates mode="cell" select="."/> > </row> > </xsl:template> > > <xsl:regexp-template mode="cell" match="([^,]*)(,|)"> > <cell> > <xsl:value-of select="current-match()[1]"/> > </cell> > </xsl:template> > > Note this assumes that ^ . and are line based even within a larger > string. (This is like emacs regexp, but unlike sed. Perl has a > switch that allows ^ and to change between matching ends of lines > and matching the ends of the string.) > > RE-4a > <xsl:regexp-template mode="unicodetotex" match="\"> > <xsl:text>\</xsl:text> > </xsl:regexp-template> > ... > > RE-4b > <xsl:regexp-template mode="textounicodetotex" match="\\"> > <xsl:text></xsl:text> > </xsl:regexp-template> > > The only interest here would be the use or priority to control > tex macro names which would match the same regexp. > > <xsl:regexp-template mode="textounicodetotex" > match="\\lt" priority="2"> > <xsl:regexp-template mode="textounicodetotex" match="\\l" > priority="1"> > > although an alternative would be to explicitly end each regexp with > a match for a non-letter (as TeX macro names only consist of letters > (or a single non-letter, such as \)) > > > RE-5 > Given a top level param keyword containing a word to be highlighted > one or more construct should allow > string expression such as concat('\b',keyword,\b') or an AVT > \b{keyword}\b > in order to construct the required regexp to match this word. > > RE-6a > <xsl:regexp-template match="\\([a-z]+){" priority="2"> > <start name="current-match()[1]"/> > </xsl:regexp-template> > > <xsl:regexp-template match="{" priority="1"> > <start name=""/> > </xsl:regexp-template> > > > <xsl:regexp-template match="}"> > <end/> > </xsl:regexp-template> > > > Applying the above regexp templates to the example \frac{1 + > \sin{2}}{3 + \cos{4}} would > produce the following sequence (of text nodes and elements), putting > them in a containing <x> element, so as to test the following > stylesheet. > > <start name="frac"/>1 + <start > name="sin"/>2<end/><end/><start name=""/>3 + <start > name="cos"/>24<end/><end/> > > To get to here > <frac> > <num>1 + <sin>2</sin></num> > <denom>3 + <cos>4</cos></denom> > </frac> > > we just need to use standard XSLT constructs, for example this XSLT1 > stylesheet does the job. It is however noticeable that to handle the > nesting in this sequence the current XPath2 constructs are very > limited, primarily due to the lack of higher order functions. > The provided "for" operator is only really suitable when each > item of a > sequence is to be processed independently. A standed > functional operator > such as fold would allow accumulation of information along > the sequence. > Here this problem is circumvented by first converting the sequence to > a tree so that the sibling-axis gives the required access, but this > seems at odds with the apparent desire to make such operations > possible at the sequence level. > > <xsl:stylesheet > xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> > > <xsl:template match="x"> > <xsl:text> > </xsl:text> > <x> > <xsl:apply-templates mode="match" select="node()[1]"/> > </x> > </xsl:template> > > <xsl:template match="text()" mode="match"> > <xsl:param name="n" select="0"/> > <xsl:copy-of select="."/> > <xsl:apply-templates mode="match" > select="following-sibling::node()[1]"> > <xsl:with-param name="n" select="n"/> > </xsl:apply-templates> > </xsl:template> > > <xsl:template match="start" mode="match"> > <xsl:param name="n" select="0"/> > <xsl:variable name="end" > > select="following-sibling::end[count(following-sibling::end)-c > ount(following-sibling::start)=n][1]"/> > <xsl:element name="{@name}"> > <xsl:apply-templates > select="following-sibling::node()[1]" mode="match"> > <xsl:with-param name="n" select="n+1"/> > </xsl:apply-templates> > </xsl:element> > <xsl:apply-templates > select="end/following-sibling::node()[1]" mode="match"> > <xsl:with-param name="n" select="n"/> > </xsl:apply-templates> > </xsl:template> > > <xsl:template match="start[@name='frac']" mode="match"> > <xsl:param name="n" select="0"/> > <frac> > <num> > <xsl:apply-templates > select="following-sibling::node()[1]" mode="match"> > <xsl:with-param name="n" select="n+1"/> > </xsl:apply-templates> > </num> > <denom> > <xsl:apply-templates > select="following-sibling::end[count(following-sibling::end)-c > ount(following-sibling::start)=n][1]/following-sibling::node( > )[2]" mode="match"> > <xsl:with-param name="n" select="n+1"/> > </xsl:apply-templates> > </denom> > </frac> > <xsl:apply-templates > select="following-sibling::end[count(following-sibling::end)-c > ount(following-sibling::start)=n][2]/following-sibling::node( > )[1]" mode="match"> > <xsl:with-param name="n" select="n"/> > </xsl:apply-templates> > </xsl:template> > > <xsl:template match="end" mode="match"/> > > > </xsl:stylesheet> > > > > RE-6b. > Parsing well formed XML does not really present any difficulties not > presented in the TeX case. The complication of macros taking more > than one {} group as arguments (as in the frac example above) > does not occur, although the regexps would need to be extended to > deal with empty element syntax and attributes. Full details are not > presented here. > > RE-6c. > As mentioned above, the general case of parsing HTML is out of scope > however simple cases of omitted tags could be dealt with using the > priority attribute on templates. > a high priority template matching "</li>\s-*<li>" would handle the > case where the end tag was explicit, and a lower priority template > matching "<li>" would match in other cases, handling the implied > closing of the previous element. > > RE-7 > This case differs from RE-4 as there is a strong left-to-right (or at > least reading direction) bias. Replacements should happen at the > start of the string. One possibe solution is to simply prefix all > regexp by a ^ character to denote the start of the string. > One slight subtlety is that this use of ^ relies on the the fact that > teh string is "split up" as each string is found, and later regexp > apply to the sequence of remaining unmatched portions. If ^ always > denotes the start of the original string then prefixing all the > transliteration replacements by ^ would clearly not have the desired > effect and only the initial characters in teh original string would > be replaced. > > RE-8 > > <xsl:regexp-template match="^ +([^:= ]+)\s+[:=]\s+(.*)" > > ...some standard template contains ... > <entry> > <xsl:apply-regexp-templates select="." /> > </entry> > > <!-- matches headings --> > <xsl:regexp-template match="^([^,]+), +\([^:]+$$:$">
>   <heading>
>     <xsl:value-of select="current-match()[1]" />
>     <subheading>
>       <xsl:value-of select="current-match()[2]" />
>     </subheading>
>   </heading>
> </xsl:regexp-template>
>
>
>
> <!-- matches items -->
> <!-- note by having an explicit regexp to pick up the item
> text you can
> easily extend to the case where you want to match as far as the next
> item or heading,
> here I'm using the regexp \' which is emacs-regexp for end-of-string
> (as opposed to $which is end-of-line) so I'll grab everything, > for now. > --> > <xsl:regexp-template match="^ ([^ ].*)$(.|\n)*\'">
>   <item>
>     <heading><xsl:value-of select="current-match()[1]" /></heading>
>     <xsl:apply-regexp-templates select="current-match()[2]" />
>   </item>
> </xsl:regexp-template>
>
>
> <!-- matches pairs -->
> <xsl:regexp-template match="^     ([^:=
> ]+)\s+[:=]\s+([^$$]*)(\([^\($$]*\))?\$">
>   <pair name="{current-match()[1]}"
>         value="{current-match()[2]}">
>           <xsl:apply-regexp-templates select="current-match()[3]"
>                                       mode="pair" />
>   </pair>
> </xsl:regexp-template>
>
>
>
> <!-- matches nested pairs -->
> <xsl:regexp-template match="^([^:= ]+)\s+[:=]\s+([^,]+),?"
>                      mode="pair">
>   <pair name="{current-match()[1]}" value="{current-match()[2]}" />
> </xsl:regexp-template>
>
>
>
> <!-- matches sentences -->
> <xsl:regexp-template match="^     ([^\.]+)." priority="-1">
>   <sentence>
>     <xsl:value-of select="concat(current-match()[1], '.')" />
>   </sentence>
> </xsl:regexp-template>
>
>
>
>
>
>
>
>
>
> _____________________________________________________________________
> This message has been checked for all known viruses by Star Internet
> delivered through the MessageLabs Virus Scanning Service. For further
> information visit http://www.star.net.uk/stats.asp or
> alternatively call
> Star Internet for details on the Virus Scanning Service.
>
`
Received on Wednesday, 23 January 2002 13:26:44 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:44:22 UTC