- From: Oliver Becker <obecker@informatik.hu-berlin.de>
- Date: Fri, 6 Feb 2004 13:50:37 +0100 (MET)
- To: public-qt-comments@w3.org, mhk@mhk.me.uk
Mike, > I think the working group members would find it very much easier to > assess the value of what you are proposing if you could provide some > examples or use cases that show how the facility would be used, ideally > comparing the existing solution with the new solution. Well, David Carlisle already provided a list a use cases in http://lists.w3.org/Archives/Public/xsl-editors/2002JanMar/0083.html I cite RE-4 and RE-7: <citation> RE-4: Multiple regexp-replace. The proposed replace function in F&O replaces substrings matching a single regexp but often one wants to replace many strings in parallel. I am assuming here that the normal XSLT creation model is followed that _all_ replacements take place (where possible, with a suitable priority mechanism for controlling clashes) on (substrings of) the original string, and a new node tree is constructed. Even when generating strings (as here) this differs from the result of repeatedly calling the replace function proposed in the F&O draft as that would, most naturally, apply later regexp matching to the _result_ of earlier matches. An example recently mentioned on xml-dev: RE-4a: Going from an XML unicode string to TeX: replace & by \& $ by \$ #169 by \copyright #233 by \'{e} < by \lt #322 by \l ... RE-4b: The reverse of this transformation. </citation> <citation> RE-7: Transliteration Take an input string in AMS cyrillic transliteration scheme and convert to Unicode characters. The exact scheme will be omitted here but the details are available at http://www.tex.org. This differs from the "multiple regexp" example in the way conflicting regexp matches need to be handled. For multiple regexp matching above one needs a priority mechanism so that certain regexp are matched first and lower priority regexp are only applied to remaining strings. Transliteration matches need to be applied by matching the start of the input string with the longest possible match, replacing this by the transliterated sequence, and then finding he longest possible match at the start of the remaining string. Thus if abc transliterates to X and bcd transliterates to Y xab Z c C d D then abcd -> XD xabcd -> ZCD Thus you could not, for example, start by replacing all abc by X. </citation> A use case from my own work: create an HTML representation (verbatim with syntax highlighting) for a given XML source. Without to discuss whether this is the right way to do, I have to replace within the text content newlines by <br> spaces by #160 < by < & by & The current specification of xsl:analyze-string requires a nested invocation like this: <xsl:analyze-string select="." regex="\n"> <xsl:matching-substring><br /></xsl:matching-substring> <xsl:non-matching-substring> <xsl:analyze-string select="." regex="' '"> <xsl:matching-substring> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:analyze-string select="." regex="[<]"> <xsl:matching-substring>&<</xsl:matching-substring> <xsl:non-matching-substring> <xsl:analyze-string select="." regex="[&]"> <xsl:matching-substring>&&</xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="." /> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:non-matching-substring> </xsl:analyze-string> (I don't use regular expression very often, so please excuse possible mistakes.) The new proposal allows a shorter notation: <xsl:analyze-string select="."> <xsl:matching-substring regex="\n"><br /></xsl:matching-substring> <xsl:matching-substring regex="' '"> </xsl:matching-substring> <xsl:matching-substring regex="[<]">&<</xsl:matching-substring> <xsl:matching-substring regex="[&]">&&</xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="." /> </xsl:non-matching-substring> </xsl:analyze-string> (It looks a little bit like choose/when/otherwise, but the semantics is left-to-right matching with the longest possible initial string. If there are two or more branches that would match the same longest string then the first branch will be used, i.e. a priority is given by the order of the matching-substring branches.) Another use case: pretty printing of code examples for common programming languages. Consider, someone writes a book about Java, C or whatever in XML (say DocBook) and wants to include code examples (real code!). The function unparsed-text() allows to access that code. Now the author wants to print all keywords in bold font, all strings italics etc ... Currently this task seems to be very difficult to solve with the given regular expression semantics (IMHO; Everybody is free to prove me wrong, of course). Particularly it is difficult to distinguish keywords from identifiers that contain keywords literally. The only practical solution currently seems to preprocess that code before using XSLT. Here's a fragment of a solution with the proposed semantics <xsl:analyze-string select="."> <!-- keywords --> <xsl:matching-substring regex="if|while|for|do| ...."> <!-- etc --> <b><xsl:value-of select="." /></b> </xsl:matching-substring> <!-- strings (simplified) --> <xsl:matching-substring regex="'"'([^"]*)'"'"> <xsl:text>"</xsl:text> <i><xsl:value-of select="regex-group(1)" /></i> <xsl:text>"</xsl:text> </xsl:matching-substring> <!-- identifiers (might contain keywords as substrings) --> <xsl:matching-substring regex="[a-zA-Z_][a-zA-Z0-9_]*"> <xsl:value-of select="." /> </xsl:matching-substring> <!-- characters that need escaping: < & (not shown, see above)--> .... <!-- everything else --> <xsl:non-matching-substring> <xsl:value-of select="." /> </xsl:non-matching-substring> </xsl:analyze-string> I hope these examples are convincing. Best regards, Oliver /-------------------------------------------------------------------\ | ob|do Dipl.Inf. Oliver Becker | | --+-- E-Mail: obecker@informatik.hu-berlin.de | | op|qo WWW: http://www.informatik.hu-berlin.de/~obecker | \-------------------------------------------------------------------/
Received on Friday, 6 February 2004 07:50:58 UTC