W3C home > Mailing lists > Public > public-qt-comments@w3.org > November 2004

Re: [F&O] regular expressions: non-capturing groups

From: Tobias Reif <tobiasreif@pinkjuice.com>
Date: Tue, 23 Nov 2004 10:50:59 +0100
To: Svgdeveloper@aol.com
Cc: public-qt-comments@w3.org
Message-ID: <20041123095059.GA2552@linux.local>

Hi

Andrew: I'll try to answer your questions below. If you need further
info, please consider contacting me offlist.

WGs, spec editors and authors, et al: The project and example I
describe is just one potential use-case. Non-capturing groups would be
generally useful, for many different use-cases and many users, no
matter if they're required for this use-case or not.

The functionality can be considered to be inside the explicitly stated
scope of the spec:

http://www.w3.org/TR/xpath-functions/#regex-syntax
:
"7.6.1 Regular Expression Syntax

The regular expression syntax used by these functions is defined in
terms of the regular expression syntax specified in XML Schema (see
[XML Schema Part 2: Datatypes]), which in turn is based on the
established conventions of languages such as Perl. However, because
XML Schema uses regular expressions only for validity checking, it
omits some facilities that are widely-used with languages such as
Perl. This section, therefore, describes extensions to the XML Schema
regular expressions syntax that reinstate these capabilities."

I don't know whether non-capturing groups are widely-used (it depends
on the POV I suspect), but AFAIK they're an important part of popular
regex implementations such as Perl's.

Since the feature is part of popular regex implementations it probably
wouldn't add more than a few lines to most XSLT2 implementations. It
also would add just a few lines to the spec.

Thanks in advance for considering my late request.

On Mon 2004-11-22 Svgdeveloper@aol.com wrote:
> Do you really need to capture the whitespace character(s) and do an 
> xsl:copy-of of it/them?

As I said, possible delimiters can include white space (they're not
always just whitespace). You could also check the linked XSLT files
for actual examples.

Here are the input files containing the program listings which are to
be marked up:
http://www.pinkjuice.com/howto/vimxml/docbook/

Here's an example of the output:
http://www.pinkjuice.com/howto/vimxml/tasks.xml#creatingdocuments
(eg the first XML listing; all other syntax markup currently is
disabled)

> Would a simple replace with a single space character before and 
> after work?

I don't really understand what you mean here.

> If so the following looks, after a brief look at your use case, to be a 
> possible solution:
> 
> regex="\s(((while|true|if|else|end)\s*)+)\s"
> 
> If that doesn't work

Thanks for taking the time trying to help. Unfortunately I don't
understand what you mean, or how it would work. You could either
detail all changes I should make to the example XSLT, or (probably
much more efficient) apply them yourself and see if your proposed
solution works (I can't do this because I don't know which changes
you're proposing, except for the new regex). If it works for the
following, please send it (on- or off-list).

Here are the input strings from the example with the desired output:

while  true
<span class="keyword">while</span>  <span class="keyword">true</span>

while true
<span class="keyword">while</span> <span class="keyword">true</span>
 
Multiple spaces must not be collapsed (and no (space or other)
characters should be added). Just the keyword is to be marked up,
excluding any delimiter (because the keyword might get styled to be
underlined, or the delimiter might be non-whitespace).

Here, nothing shold be matched:

whiletrue

Other possible delimiters include parentheses:
(not included in the current example regex)

(while true end)
(<span class="keyword">while</span> <span class="keyword">true</span> <span class="keyword">end</span>)

Before a keyword there can be nothing (no character): ^ , and also
after it $ .

> perhaps you could explain in greater detail what it is you want to
> match.

Lots of different things, as typical when applying syntax markup to
various programming languages.

Examples:

$(command)
"literal string"
object.method
  # object and method names share the dot as delimiter
object.method(args)
function_call(args)
  # a recognized function
`command`
if true;
  then true;
end
(if x true else false end)
# etc etc

I don't want to use XSLT2 (plus regexen) to fully parse complex code
(I'd have to build a parser for this), but to apply simple markup to
basic stuff such as keywords.

The linked input and XSLT files actually answer your question much
better.

Currently, when I have
  def true
inside a Ruby program in ch04.xml, I get
  <code class="keyword">def</code> true
instead of
  <code class="keyword">def</code> <code class="keyword">true</keyword>
in the output (moresetup.xml).

But I thought it would simplify things when I create and post a
short example.

Tobi

-- 
to
  bi
    as
  re
if
Received on Tuesday, 23 November 2004 09:51:05 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:57:02 UTC