Re: [XSLT2] OB06 xsl:analyze-string from Oliver Becker on 2004-02-06 (public-qt-comments@w3.org from February 2004)

From: Oliver Becker <obecker@informatik.hu-berlin.de>
Date: Fri, 6 Feb 2004 13:50:37 +0100 (MET)
To: public-qt-comments@w3.org, mhk@mhk.me.uk
Message-Id: <200402061250.i16Cobx8013868@mail.informatik.hu-berlin.de>
Mike,

> I think the working group members would find it very much easier to
> assess the value of what you are proposing if you could provide some
> examples or use cases that show how the facility  would be used, ideally
> comparing the existing solution with the new solution.

Well, David Carlisle already provided a list a use cases in
http://lists.w3.org/Archives/Public/xsl-editors/2002JanMar/0083.html

I cite RE-4 and RE-7:
<citation>
RE-4: Multiple regexp-replace.
  The proposed replace function in F&O replaces substrings matching a
  single regexp but often one wants to replace many strings in parallel.

  I am assuming here that the normal XSLT creation model is followed that
  _all_ replacements take place (where possible, with a suitable priority
  mechanism for controlling clashes) on (substrings of) the original
  string, and a new node tree is constructed. Even when generating strings
  (as here) this differs from  the result of repeatedly calling the replace
  function proposed in the F&O draft as that would, most naturally, apply
  later regexp matching to the _result_ of earlier matches.

  An example recently mentioned on xml-dev:

RE-4a: Going from an XML unicode string to TeX:
    replace &     by \&
            $     by \$
            #169  by \copyright
            #233  by \'{e}
            <     by \lt
            #322 by \l
            ...
RE-4b: The reverse of this transformation.
</citation>

<citation>
RE-7: Transliteration
  Take an input string in AMS cyrillic transliteration scheme and convert
  to Unicode characters. The exact scheme will be omitted here but the
  details are available at http://www.tex.org.
  This differs from the "multiple regexp" example
  in the way conflicting regexp matches need to be handled. For multiple
  regexp matching above one needs a priority mechanism so that certain
  regexp are matched first and lower priority regexp are only applied to
  remaining strings. Transliteration matches need to be applied by
  matching the start of the input string with the longest possible match,
  replacing this by the transliterated sequence, and then finding he
  longest possible match at the start of the remaining string.
  Thus if abc transliterates to X and 
          bcd transliterates to Y
          xab                   Z
          c                     C
          d                     D
  then
    abcd  -> XD
    xabcd -> ZCD
  Thus you could not, for example, start by replacing all abc by X.
</citation>


A use case from my own work: create an HTML representation (verbatim
with syntax highlighting) for a given XML source. Without to discuss whether
this is the right way to do, I have to replace within the text content
  newlines by <br>
  spaces by #160
  < by &lt;
  & by &amp;
The current specification of xsl:analyze-string requires a nested
invocation like this:
<xsl:analyze-string select="." regex="\n">
  <xsl:matching-substring><br /></xsl:matching-substring>
  <xsl:non-matching-substring>
    <xsl:analyze-string select="." regex="' '">
      <xsl:matching-substring>&#160;</xsl:matching-substring>
      <xsl:non-matching-substring>
        <xsl:analyze-string select="." regex="[&lt;]">
          <xsl:matching-substring>&amp;&lt;</xsl:matching-substring>
          <xsl:non-matching-substring>
            <xsl:analyze-string select="." regex="[&amp;]">
              <xsl:matching-substring>&amp;&amp;</xsl:matching-substring>
              <xsl:non-matching-substring>
                <xsl:value-of select="." />
              </xsl:non-matching-substring>
            </xsl:analyze-string>
          </xsl:non-matching-substring>
        </xsl:analyze-string>
      </xsl:non-matching-substring>
    </xsl:analyze-string>
  </xsl:non-matching-substring>
</xsl:analyze-string>

(I don't use regular expression very often, so please excuse possible mistakes.)

The new proposal allows a shorter notation:
<xsl:analyze-string select=".">
  <xsl:matching-substring regex="\n"><br /></xsl:matching-substring>
  <xsl:matching-substring regex="' '">&#160;</xsl:matching-substring>
  <xsl:matching-substring regex="[&lt;]">&amp;&lt;</xsl:matching-substring>
  <xsl:matching-substring regex="[&amp;]">&amp;&amp;</xsl:matching-substring>
  <xsl:non-matching-substring>
    <xsl:value-of select="." />
  </xsl:non-matching-substring>
</xsl:analyze-string>

(It looks a little bit like choose/when/otherwise, but the semantics is
left-to-right matching with the longest possible initial string.
If there are two or more branches that would match the same longest
string then the first branch will be used, i.e. a priority is given
by the order of the matching-substring branches.)


Another use case: pretty printing of code examples for common programming
languages. Consider, someone writes a book about Java, C or whatever in
XML (say DocBook) and wants to include code examples (real code!).
The function unparsed-text() allows to access that code.
Now the author wants to print all keywords in bold font, all strings
italics etc ...
Currently this task seems to be very difficult to solve with the given
regular expression semantics (IMHO; Everybody is free to prove me wrong,
of course). Particularly it is difficult to distinguish keywords from
identifiers that contain keywords literally. The only practical solution
currently seems to preprocess that code before using XSLT.

Here's a fragment of a solution with the proposed semantics
<xsl:analyze-string select=".">
  <!-- keywords -->
  <xsl:matching-substring regex="if|while|for|do| ....">   <!-- etc -->
     <b><xsl:value-of select="." /></b>
  </xsl:matching-substring>
  <!-- strings (simplified) -->
  <xsl:matching-substring regex="'&quot;'([^&quot;]*)'&quot;'"> 
     <xsl:text>"</xsl:text>
     <i><xsl:value-of select="regex-group(1)" /></i>
     <xsl:text>"</xsl:text>
  </xsl:matching-substring>
  <!-- identifiers (might contain keywords as substrings) -->
  <xsl:matching-substring regex="[a-zA-Z_][a-zA-Z0-9_]*">
     <xsl:value-of select="." />
  </xsl:matching-substring>
  <!-- characters that need escaping: &lt; &amp; (not shown, see above)-->
     ....
  <!-- everything else -->
  <xsl:non-matching-substring>
     <xsl:value-of select="." />
  </xsl:non-matching-substring>
</xsl:analyze-string>

     
I hope these examples are convincing.
Best regards,
Oliver


/-------------------------------------------------------------------\
|  ob|do        Dipl.Inf. Oliver Becker                             |
|  --+--        E-Mail: obecker@informatik.hu-berlin.de             |
|  op|qo        WWW:    http://www.informatik.hu-berlin.de/~obecker |
\-------------------------------------------------------------------/
Received on Friday, 6 February 2004 07:50:58 UTC