- From: Kay, Michael <Michael.Kay@softwareag.com>
- Date: Wed, 23 Jan 2002 19:26:35 +0100
- To: "'David Carlisle'" <davidc@nag.co.uk>, xsl-editors@w3.org
Many thanks for this note. We had a discussion on regular expressions at the
XSL WG meeting earlier this week; we agreed that the facilities currently
specified in Functions & Operators are probably inadequate to meet the XSLT
requirement, and that it would be useful to start by producing a set of use
cases to test any proposed improvements against. Your input is therefore
very valuable.
Mike Kay
> -----Original Message-----
> From: David Carlisle [mailto:davidc@nag.co.uk]
> Sent: 22 January 2002 23:58
> To: xsl-editors@w3.org
> Subject: regexp support
>
>
>
> This got a bit longer than perhaps is wise as an "initial comment"
> but some thoughts on regexp in XSLT...
>
> David
>
>
>
> Regular expression support in XPath/XSLT/Xquery
> ===============================================
>
> 1)Abstract
> ----------
>
> This note proposes the addition of support for regular
> expressions in XSLT 2
> beyond that currently proposed by the functions in the XPath 2 draft.
>
> It is loosely based on discussions on and off xsl-list,
> principally with
> Jeni Tennison, although the details of this proposal are mine
> and Jeni and
> others who took part in the xsl-list discussions should feel free to
> comment (and disagree!) with any parts of this.
>
> It does propose some possible syntax for this functionality
> but this is
> just a draft; the main aim of the note is to put forward some
> use cases and
> requirements. The exact syntax would probaby need to be refined.
>
> One of the main issues raised by the use cases to be
> presented here is the
> requirement to build a tree fragment (as opposed to a string)
> based on the
> matching (or not) of regular expressions to an input string.
> The Functions
> and Operators draft currently suggests a regexp-replace
> function but this
> is restricted to constructing strings, so is of limited
> usefulness in an
> XSLT (or Xquery) context.
>
> Whilst the extended function definition possibilities in
> XPath 2 may, in
> principle, mean that it would be possible to add such functionality to
> Xpath, I think that it is a useful distinction in XSLT that should be
> preserved that the principle node creation mechanism is via XSLT, with
> Xpath being primarily a selection and expression language. It is quite
> likely that Xquery will require similar functionality, and in Xquery a
> function form may be natural, but this note concentrates on XSLT and
> suggests an extension of the XSLT template mechanism. This should not
> inhibit he addition of similar functionality to Xquery with a
> different
> syntax, as differing syntax for node creation is in any case
> one of the
> major distinguishing features between the two languages.
>
> David Carlisle
>
>
>
> 2) Contents
> ----------
>
> 1) Abstract
> 2) Contents
> 3) Regexp Syntax
> 4) Use Cases
> 5) Possible XSLT2 Regexp syntax
> 6) Variants on the propsosal
> 7) Suggested solutions to the use cases.
>
>
> 3) Regexp Syntax
> ----------------
>
> It is clearly desirable that the regexp syntax in XPath is largely
> compatible with that of XML Schema, however I feel that the
> requirements of
> searching and replacing substrings within a larger input string (the
> typical scenarios presented here) are rather different from the
> requirements for specifying regexp that fully match the
> (typically smaller)
> complete character content of a typed element.
> Thus in my examples below I will use extended regexp syntax
> if this seems
> appropriate staying compatible with perl wherever possible, (although
> personally, I'm more familiar with the slightly different emacs
> conventions).
>
> In particular, regexp used for search and replace will need to be
> unanchored and special characters will need to be introduced,
> for anchoring
> to start and end of the string and/or lines (^ $ \z and \Z in
> perl regexp
> syntax) and possibly other meta characters or character
> classes will need
> to be added, depending on perceived requirements. However
> this note does
> not concentrate on the regexp syntax but rather on the
> possibilities for
> making structured replacements. This is the classic "up
> translation" task
> of moving from unstructured (or insufficiently structured) data to
> structured matkup.
>
>
> 3) Use Cases
> ------------
>
> RE-1: HTML Line Break
> In an input string replace all line end characters (which
> we may assume
> have already been normalised to #xA) by the element <br/>.
>
> This is one of the more common requests on XSL-list. It
> does not actually
> require any regular expression support as it is searching
> for a single
> character (although one may possibly want to search for
> other line end
> characters if the string has not been through an XML line end
> normalisation, as presumably (?) is the case for the unparsed-text()
> function.) It does however demonstrate the need to generate
> element nodes
> at positions determined by searching a string.
>
> RE-2: Natural language date strings.
> Here the aim is to split up the (complete) string which is
> known to be a
> date in (say) English language format such as "17th January, 2002"
> and produce the string in ISO format 2002-01-17 suitable
> for coercing to
> a dateTime, and being used with Xpath date expressions.
>
> RE-3: Parsing a CSV file into an XML tree fragment.
> Given an input string
> 1,2
> 3,4
> produce the tree fragment equivalent to
> <table>
> <row><cell>1</cell><cell>2</cell></row>
> <row><cell>3</cell><cell>4</cell></row>
> </table>
>
> RE-4: Multiple regexp-replace.
> The proposed replace function in F&O replaces substrings matching a
> single regexp but often one wants to replace many strings
> in parallel.
>
> I am assuming here that the normal XSLT creation model is
> followed that
> _all_ replacements take place (where possible, with a
> suitable priority
> mechanism for controlling clashes) on (substrings of) the original
> string, and a new node tree is constructed. Even when
> generating strings
> (as here) this differs from the result of repeatedly
> calling the replace
> function proposed in the F&O draft as that would, most
> naturally, apply
> later regexp matching to the _result_ of earlier matches.
>
> An example recently mentioned on xml-dev:
>
> RE-4a: Going from an XML unicode string to TeX:
> replace & by \&
> $ by \$
> #169 by \copyright
> #233 by \'{e}
> < by \lt
> #322 by \l
> ...
> RE-4b: The reverse of this transformation.
>
> RE-5: Calculation of "dynamic" regular expressions.
> Whilst most suggested "template match" extensions to regexp matching
> have assumed constant match strings, at least some support should be
> available to build up regular expressions as the result of XPath
> expressions. For example a stylesheet might accept a word
> (or list of
> words) as a parameter and build up the regexp adding word boundary
> markers (\b in perl) and and alternation (|). This string valued
> expression should then be usable as a regular expression.
> (From a user perspective, all uses of regexp could be replaced by
> string valued expressions (or avt) although efficiency and other
> considerations may not allow all uses of regexp to have this
> flexibility).
>
> RE-6: Nested structures.
> Input strings with arbitrary nested structure for example, well
> formed HTML, TeX \aaa{...} syntax, lisp (...) syntax, are not
> regular languages and so can not (by definition) be parsed by a
> single regular expression. However in many cases (including all of
> the examples cited above) the tokens delimiting the nesting may be
> matched by regular expressions. This should allow the input string
> to be tokenised using regexp into a format in which the recursion
> and/or counting required to handle the nested structure may be
> handled by standard constructs in the language (XPath or XSLT or
> Xquery).
>
> Some people have suggested that XSLT should be directly extended to
> support the specification of more general grammars, in the style of
> lex/yacc. But the proposal here is that regexp support, if
> sufficiently
> integrated into the existing functionality of the language,
> should be able
> to handle a large range of cases without the complications
> of explictly
> adding more general parsing support.
>
> RE-6a: TeX (simplified)
> Convert
> \\([a-z])+{....} to <\1>...</\1>
> but with special case of
> \\frac{....}{....} to
> <frac><num>...</num><denom>...</denom></frac>
>
> For example
> \frac{1 + \sin{2}}{3 + \cos{4}}
> to
> <frac>
> <num>1 + <sin>2</sin></num>
> <denom>3 + <cos>4</cos></denom>
> </frac>
>
> RE-6b: Well formed XML markup
> Parse a well formed XML instance that has no DOCTYPE, entity or
> character references or attributes. (These could be added
> but without
> adding any major new issues.)
>
> For example convert the input xml
> <entry><![CDATA[<abc>12 <x/> <wxyz>34</wxyz></abc>]]></entry>
> To
> <entry><abc>12 <x/> <wxyz>34</wxyz></abc></entry>
>
> RE-6c: HTML Markup.
> As above but with HTML, in particular with implied end
> tags. In general
> this requires a DTD and knowledge of SGML omitted tag
> rules. To handle
> general HTML as it appears in the wild, arbitrarily complicated "tag
> soup" parsing heuristics as implemented in the browsers would be
> needed. However this appears to be a very common requirement often
> generated by storing HTML fragments as strings in a
> database. One may
> hope that specific simple cases may be handled for example:
>
> Convert a list
> <entry><![CDATA[<ol><li>aaa <li>bbb </ol>]]></entry>
> to
> <entry><ol><li>aaa </li><li>bbb </li></ol></entry>
>
> RE-7: Transliteration
> Take an input string in AMS cyrillic transliteration scheme
> and convert
> to Unicode characters. The exact scheme will be omitted here but the
> details are available at http://www.tex.org.
> This differs from the "multiple regexp" example
> in the way conflicting regexp matches need to be handled.
> For multiple
> regexp matching above one needs a priority mechanism so that certain
> regexp are matched first and lower priority regexp are only
> applied to
> remaining strings. Transliteration matches need to be applied by
> matching the start of the input string with the longest
> possible match,
> replacing this by the transliterated sequence, and then finding he
> longest possible match at the start of the remaining string.
> Thus if abc transliterates to X and
> bcd transliterates to Y
> xab Z
> c C
> d D
> then
> abcd -> XD
> xabcd -> ZCD
> Thus you could not, for example, start by replacing all abc by X.
>
>
> RE-8: Free format text input.
> This example is based on a (real) question in xsl-list.
> (using | denote line start, ignoring indentation for this mail)
>
> |Some heading, with subphrases:
> | An item without a bullet.
> | Name = value pair.
> | Property: value.
> | Score = 7 (a = 1, b =3, c = 4).
> | A full sentence that has so many words that it spans
> | multiple lines.
> | Sometimes we can't even trust whether people get the
> |indention consistent.
> |
> |and making it:
> |
> |<entry>
> | <heading>
> | Some heading
> | <subheading>with subheading:</subheading>
> | </heading>
> | <item>
> | <heading>An item without a bullet.</heading>
> | <pair name='name' value='value pair.'/>
> | <pair name='property' value='value.'/>
> | <pair name='Score' value='7'>
> | <pair name='a' value='1'/>
> | <pair name='b' value='3'/>
> | <pair name='c' value='4'/>
> | </pair>
> | <sentence>A full sentence that has so many words
> that it spans
> | multiple lines.</sentence>
> | <sentence>Sometimes we can't even trust whether
> people get the
> |indention consistent.</sentence>
> | </item>
> |</entry>
> |
> |
>
>
>
> 5) Possible XSLT2 Regexp syntax
> -------------------------------
>
> The basic idea outlined in the proposal below is that the main task in
> all the above use cases is the construction of a result tree given
> some input. The construction aspects of the new functionality should
> therefore be designed to match existing construction possibilities,
> with the only difference being that they are triggered by a substring
> of an input string matching a regexp rather than a node in an input
> tree matching an Xpath.
>
> A new instruction: <xsl:apply-regexp-templates>
>
> Taking same attributes and content as apply-templates.
> The select attribute should evaluate to a sequence of string-valued
> items. If more than one string is in the sequence, the result is the
> sequence produced by concatenating the result of processing each
> string. In fact all the examples presented will always have a
> sequence of at most one (that is, a string) and it would be possible
> to specify that the argument should be a single string if that proves
> to be a useful simplification.
>
> A new top level instruction: <xsl:regexp-template>
>
> Taking a mandatory match attribute
> optional priority attribute.
> and optional mode attribute.
>
> The mode attribute works as for xsl:template, if xsl:apply-regexp
> templates specifies a mode then only regexp-templates's declared for
> that mode will be considered.
>
> The match attribute takes a regular expression, ie a restricted form
> of string. It is assumed here that these are essentially fixed
> strings, if implementation/efficiency concerns allow they could
> perhaps be attribute value templates to allow more dynamic choice in
> the regular expressions. Or, equivalently to AVT, but with slightly
> different syntax, the match attribute could take arbitrary xpath
> expressions so long as they evaluated to a string that was a legal
> regexp. (As another variant not further explored here one could
> consider regexp to be a derived type from string rather than typing
> regexps as strings and just stating at a meta level that they have
> to match regexp syntax).
>
> So a typical example, meeting the first use case, would be
> <xsl:template match="xx"/>
> <div>
> <xsl:apply-regexp-templates match="."/>
> </div>
> </xsl:template>
> ...
> <xsl:regexp-template match=" ">
> <br/>
> </xsl:regexp-template>
>
> The template matching <xx> would then result in the character data
> of xx being copied into a div element in the output, with all new
> line characters become html br elements.
>
> A new XSLT-specific XPath function current-match().
> Within a regexp-template current-match will return a sequence of 1
> or more strings the ith item being the substring matching the ith
> parenthised expression in the regexp of the template.
> thus given a regexp of (a*)(b*) matching aaabb then within the
> template, . will be "aaabb" current-match()[1] will be "aaa" and
> current-match()[2] will be "bb". In the presence of alternation (|)
> and repeat clauses ({3}) it isn't always immediately clear how
> subexpressions should be numbered but perl semantics should be
> followed (as schema explicitly tries to be perl like in its regexp
> semantics).
>
>
> In detail the execution model would be as follows.
>
> a. The select attribute of apply-regexp-templates is evaluated.
> If it is not a sequence of strings an error is raised.
> If it is a sequence each is processed separately and the result is
> the sequence of results.
> So we need only consider a single string.
>
> b. If a mode is supplied all regexp-templates in that mode are now
> consided, otherwise all the regexp templates in the default mode
> are considered.
>
> c. The templates being considered are then ordered by priority.
> If two templates of equal priority could potentially match
> overlapping strings then eitherthis would be an error or a default
> priority scheme would enforce an ordering (either implementation
> defined or order in stylesheet) (to be decided).
>
> For each template in turn, starting with highest priority,
> the regexp is matched on the subsequences of the original input
> string that have yet to be matched. Once a match is found the
> sequence of substrings is extended by splitting the current
> substring into three: the substring-before the matched substring
> and the substring after. This continues, finally using a default
> template of which matches every character "(.| )" until the end
> result is that the original string is now a sequence of
> substrings each associated with a template whose match regexp
> produced the substring.
> Now the focus is set such that this derived sequence is the current
> sequence, and each of the associated regexp-templates is executed
> in order with the current item being the matched substring and
> position() being the position in the derived sequence.
>
> This means that the regexp's can access the matched string using .
> finer control, accessing subexpressions can be achieved using
> the match-string() function described above.
>
> d. The resulting sequence is the concatenation of the results of each
> of these templates.
>
>
>
>
> 6) Variants on the propsosal
> ----------------------------
> This section discusses some variants on the above. The variations are
> often mutually exclusive, it is not proposed that all these features
> are added simultaneously, although of course a system that
> incorporated some features of more than one of these variants would
> also be possible.
>
> V1: Tokenise Function
> It would be possible to make the implicit splitting up of the input
> string into a sequence of substrings explicit.
> <xsl:apply-regexp-templates select="$string"/>
> would be replaced by
> <xsl:apply-regexp-templates select="regexp-tokenise($string)"/>
> where regexp-tokenise would a set of regexp's (specified by a method
> to be determined) and split up the string as above. The difference
> would be that the sequence resulting would "just" be a sequence of
> strings, which would make it a first class object in the data model.
> Of course apply-regexp-templates would take any sequence of strings
> they would not be forced to be generated by tokenise() (although in
> practice that would be most common).
>
> In this model the semantics of the match expression in
> <xsl:regexp-template
> is slightly different. rather than matching on a substring of some
> input string, it would be an _anchored_ match and the template would
> fire if the regexp matched an entire string in the input sequence.
> As in the main proposal . and position() etc would reflect the
> position of the matched item in the input sequence.
>
> The advantage of this model is that the sequence is made explicit
> as a standard XPath sequence. The disadvantage is that the regexps
> may have to be specified (and executed) twice. Once as unanchored
> regexps to tokenise the input string into a sequence of substrings,
> and then again as anchored regexps to associate templates with each
> of the matched substrings.
>
>
> V2: Immediate template execution
> Rather than first building an implicit sequence of substrings the
> mechanism could be that as each regexp is matched against a
> substring
> of the original string, a sequence is built as in the main proposal
> with the string-before and string-after the match but in this case
> between these is placed the result of executing the template.
>
> This avoids having to build the sequence of strings "associating"
> each one with a template, but it is harder to suggest good values
> for . and position() in this case.
> possibly y position() should be 1 and . should be the original
> string (in the case that apply-regexp-templates was just given a
> select expression of a single string. In particular the focus would
> not change as regexp-templates were applied. (Similar to named
> templates.)
>
> V1 and V2 produce (at least for templates not using an implicit
> setting of . or position()) the same results as the main proposal.
> The last variant has a different model of conflict resolution for
> overlapping matches and will typically produce different a
> result given
> similar looking regexp.
>
> V3: left-to-right matching.
> Rather than match regexp in priority-order, an alternative matching
> scheme would be to start from the start of the string and find the
> longest possible match from all the templates under consideration.
> priority specifications would choose between regexps if more than
> one matched the longest possible initial string.
>
> The template for this regexp would fire. Processing could either
> then immediately proceed to the remaining substring, with the system
> finding the longest initial match on the remainder, or processing
> could effectively stop as soon as a match was found, with teh
> remaining string being available (say with a string-sfter-match()
> function) and so the template could explicitly invoke
> apply-regexp-templates select="substring-after-match()"
> at some suitable point in its execution.
>
> In this model . would be the matched substring and position()
> would be (say) always 1.
>
> V4: Support for matching pairs.
> Many of the examples (as in the RE-6 use cases) require matching a
> nested structure which is beyond what is possible with a single
> regexp. Existing XSLT facilities are sufficient to "fill the gap"
> providing the necessary arithmetic and state to handle the nested
> parse tree. However one could consider adding faclities to make the
> required transformations easier. It is essentially a grouping
> problem although the new Grouping constructs in XSLT2 didn't seem
> immediately applicable. As an alternative to adding general grouping
> support for this kind of task it has been suggested that a special
> template that matches on substrings between matching tokens matched
> by a "start" and and "end" regexp could be provided.
>
> <xsl:regexp-balancing-template match-start="\\func\{"
> match-end="\}">
> This would take two regexp match attributes/ The template would be
> handed the intervening string as well as the two matching strings.
> The system would handle the necessary counting to ensure that
> the start and end expressions were correctly paired.
>
> 7) Suggested solutions to the use cases.
> ---------------------------------------
>
> Only solutions using the "main proposal" are given here.
>
>
> RE-1
> <xsl:template match="xx"/>
> <div>
> <xsl:apply-regexp-templates match="."/>
> </div>
> </xsl:template>
> ...
> <xsl:regexp-template match=" ">
> <br/>
> </xsl:regexp-template>
>
>
> As commented above this doesn't use regexp but is a natural simple
> example, that is quite hard to do in XSLT1 (or even XSLT2 as
> currently drafted) If the input string has not been through an XML
> parser (as will be possible in XSLT2) then even this case might
> benefit from simple regexp support, changing the match to
> ?|
>
> RE-2
> <xsl:apply-regexp-templates select="'17th January, 2002'"/>
> ...
> <xsl:regexp-template match="^ *([0-9][0-9])
> +([A-Za-z]{3})[a-z]* +([0-9]+) *$">
> <xsl:value-of select="format-number(current-match()[1],'00')"/>
> <xsl:choose>
> <xsl:when test="lower(current-match()[1])='jan'">-01-</xsl:when>
> <xsl:when test="lower(current-match()[1])='feb'">-02-</xsl:when>
> ...
> </xsl:choose>
> <xsl:variable name="y" select="number(current-match()[3])"/>
> <xsl:value-of select="format-number(
> <xsl:value-of select="if($x < 10) then
> (1900 + $x) else
> (if($x < 1000) then( 2000 + $x) else $x) "/>
> </xsl:regexp-template>
> ....
>
>
> RE-3
> Assuming the CSV string is the current item.
>
> <table>
> <xsl:apply-regexp-templates mode="row" select="."/>
> </table>
> ..
> <xsl:regexp-template mode="row" match="^(.*)$">
> <row>
> <xsl:apply-regexp-templates mode="cell" select="."/>
> </row>
> </xsl:template>
>
> <xsl:regexp-template mode="cell" match="([^,]*)(,|$)">
> <cell>
> <xsl:value-of select="current-match()[1]"/>
> </cell>
> </xsl:template>
>
> Note this assumes that ^ . and $ are line based even within a larger
> string. (This is like emacs regexp, but unlike sed. Perl has a
> switch that allows ^ and $ to change between matching ends of lines
> and matching the ends of the string.)
>
> RE-4a
> <xsl:regexp-template mode="unicodetotex" match="\$">
> <xsl:text>\$</xsl:text>
> </xsl:regexp-template>
> ...
>
> RE-4b
> <xsl:regexp-template mode="textounicodetotex" match="\\$">
> <xsl:text>$</xsl:text>
> </xsl:regexp-template>
>
> The only interest here would be the use or priority to control
> tex macro names which would match the same regexp.
>
> <xsl:regexp-template mode="textounicodetotex"
> match="\\lt" priority="2">
> <xsl:regexp-template mode="textounicodetotex" match="\\l"
> priority="1">
>
> although an alternative would be to explicitly end each regexp with
> a match for a non-letter (as TeX macro names only consist of letters
> (or a single non-letter, such as \$))
>
>
> RE-5
> Given a top level param $keyword containing a word to be highlighted
> one or more construct should allow
> string expression such as concat('\b',$keyword,`\b') or an AVT
> \b{$keyword}\b
> in order to construct the required regexp to match this word.
>
> RE-6a
> <xsl:regexp-template match="\\([a-z]+){" priority="2">
> <start name="current-match()[1]"/>
> </xsl:regexp-template>
>
> <xsl:regexp-template match="{" priority="1">
> <start name=""/>
> </xsl:regexp-template>
>
>
> <xsl:regexp-template match="}">
> <end/>
> </xsl:regexp-template>
>
>
> Applying the above regexp templates to the example \frac{1 +
> \sin{2}}{3 + \cos{4}} would
> produce the following sequence (of text nodes and elements), putting
> them in a containing <x> element, so as to test the following
> stylesheet.
>
> <start name="frac"/>1 + <start
> name="sin"/>2<end/><end/><start name=""/>3 + <start
> name="cos"/>24<end/><end/>
>
> To get to here
> <frac>
> <num>1 + <sin>2</sin></num>
> <denom>3 + <cos>4</cos></denom>
> </frac>
>
> we just need to use standard XSLT constructs, for example this XSLT1
> stylesheet does the job. It is however noticeable that to handle the
> nesting in this sequence the current XPath2 constructs are very
> limited, primarily due to the lack of higher order functions.
> The provided "for" operator is only really suitable when each
> item of a
> sequence is to be processed independently. A standed
> functional operator
> such as fold would allow accumulation of information along
> the sequence.
> Here this problem is circumvented by first converting the sequence to
> a tree so that the sibling-axis gives the required access, but this
> seems at odds with the apparent desire to make such operations
> possible at the sequence level.
>
> <xsl:stylesheet
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
>
> <xsl:template match="x">
> <xsl:text>
> </xsl:text>
> <x>
> <xsl:apply-templates mode="match" select="node()[1]"/>
> </x>
> </xsl:template>
>
> <xsl:template match="text()" mode="match">
> <xsl:param name="n" select="0"/>
> <xsl:copy-of select="."/>
> <xsl:apply-templates mode="match"
> select="following-sibling::node()[1]">
> <xsl:with-param name="n" select="$n"/>
> </xsl:apply-templates>
> </xsl:template>
>
> <xsl:template match="start" mode="match">
> <xsl:param name="n" select="0"/>
> <xsl:variable name="end"
>
> select="following-sibling::end[count(following-sibling::end)-c
> ount(following-sibling::start)=$n][1]"/>
> <xsl:element name="{@name}">
> <xsl:apply-templates
> select="following-sibling::node()[1]" mode="match">
> <xsl:with-param name="n" select="$n+1"/>
> </xsl:apply-templates>
> </xsl:element>
> <xsl:apply-templates
> select="$end/following-sibling::node()[1]" mode="match">
> <xsl:with-param name="n" select="$n"/>
> </xsl:apply-templates>
> </xsl:template>
>
> <xsl:template match="start[@name='frac']" mode="match">
> <xsl:param name="n" select="0"/>
> <frac>
> <num>
> <xsl:apply-templates
> select="following-sibling::node()[1]" mode="match">
> <xsl:with-param name="n" select="$n+1"/>
> </xsl:apply-templates>
> </num>
> <denom>
> <xsl:apply-templates
> select="following-sibling::end[count(following-sibling::end)-c
> ount(following-sibling::start)=$n][1]/following-sibling::node(
> )[2]" mode="match">
> <xsl:with-param name="n" select="$n+1"/>
> </xsl:apply-templates>
> </denom>
> </frac>
> <xsl:apply-templates
> select="following-sibling::end[count(following-sibling::end)-c
> ount(following-sibling::start)=$n][2]/following-sibling::node(
> )[1]" mode="match">
> <xsl:with-param name="n" select="$n"/>
> </xsl:apply-templates>
> </xsl:template>
>
> <xsl:template match="end" mode="match"/>
>
>
> </xsl:stylesheet>
>
>
>
> RE-6b.
> Parsing well formed XML does not really present any difficulties not
> presented in the TeX case. The complication of macros taking more
> than one {} group as arguments (as in the frac example above)
> does not occur, although the regexps would need to be extended to
> deal with empty element syntax and attributes. Full details are not
> presented here.
>
> RE-6c.
> As mentioned above, the general case of parsing HTML is out of scope
> however simple cases of omitted tags could be dealt with using the
> priority attribute on templates.
> a high priority template matching "</li>\s-*<li>" would handle the
> case where the end tag was explicit, and a lower priority template
> matching "<li>" would match in other cases, handling the implied
> closing of the previous element.
>
> RE-7
> This case differs from RE-4 as there is a strong left-to-right (or at
> least reading direction) bias. Replacements should happen at the
> start of the string. One possibe solution is to simply prefix all
> regexp by a ^ character to denote the start of the string.
> One slight subtlety is that this use of ^ relies on the the fact that
> teh string is "split up" as each string is found, and later regexp
> apply to the sequence of remaining unmatched portions. If ^ always
> denotes the start of the original string then prefixing all the
> transliteration replacements by ^ would clearly not have the desired
> effect and only the initial characters in teh original string would
> be replaced.
>
> RE-8
>
> <xsl:regexp-template match="^ +([^:= ]+)\s+[:=]\s+(.*)$"
>
> ...some standard template contains ...
> <entry>
> <xsl:apply-regexp-templates select="." />
> </entry>
>
> <!-- matches headings -->
> <xsl:regexp-template match="^([^,]+), +\([^:]+\):$">
> <heading>
> <xsl:value-of select="current-match()[1]" />
> <subheading>
> <xsl:value-of select="current-match()[2]" />
> </subheading>
> </heading>
> </xsl:regexp-template>
>
>
>
> <!-- matches items -->
> <!-- note by having an explicit regexp to pick up the item
> text you can
> easily extend to the case where you want to match as far as the next
> item or heading,
> here I'm using the regexp \' which is emacs-regexp for end-of-string
> (as opposed to $ which is end-of-line) so I'll grab everything,
> for now.
> -->
> <xsl:regexp-template match="^ ([^ ].*)$(.|\n)*\'">
> <item>
> <heading><xsl:value-of select="current-match()[1]" /></heading>
> <xsl:apply-regexp-templates select="current-match()[2]" />
> </item>
> </xsl:regexp-template>
>
>
> <!-- matches pairs -->
> <xsl:regexp-template match="^ ([^:=
> ]+)\s+[:=]\s+([^\(]*)(\([^\(\)]*\))?$">
> <pair name="{current-match()[1]}"
> value="{current-match()[2]}">
> <xsl:apply-regexp-templates select="current-match()[3]"
> mode="pair" />
> </pair>
> </xsl:regexp-template>
>
>
>
> <!-- matches nested pairs -->
> <xsl:regexp-template match="^([^:= ]+)\s+[:=]\s+([^,]+),?"
> mode="pair">
> <pair name="{current-match()[1]}" value="{current-match()[2]}" />
> </xsl:regexp-template>
>
>
>
> <!-- matches sentences -->
> <xsl:regexp-template match="^ ([^\.]+)." priority="-1">
> <sentence>
> <xsl:value-of select="concat(current-match()[1], '.')" />
> </sentence>
> </xsl:regexp-template>
>
>
>
>
>
>
>
>
>
> _____________________________________________________________________
> This message has been checked for all known viruses by Star Internet
> delivered through the MessageLabs Virus Scanning Service. For further
> information visit http://www.star.net.uk/stats.asp or
> alternatively call
> Star Internet for details on the Virus Scanning Service.
>
Received on Wednesday, 23 January 2002 13:26:44 UTC