[F&O] regex syntax too limited from Gunther Schadow on 2004-12-15 (public-qt-comments@w3.org from December 2004)

From: Gunther Schadow <gunther@aurora.regenstrief.org>
Date: Wed, 15 Dec 2004 13:33:43 -0500
To: public-qt-comments@w3.org
Message-ID: <41C08387.9040203@aurora.regenstrief.org>
This comment is against the XQuery 1.0 and XPath 2.0 Functions and Operators
W3C Working Draft 29 October 2004. An extended version of this comment 
has been has been submitted as an XSLT2.0 comment as well to include 
considerations that are specific for XSLT.


ISSUE

The regular expression syntax allowed by F&O is too limited to the 
extent that it limits XSLT's analyze-string instrucion too much.
The editors of XPath 2.0 F&O have refered to XML Schema Datatypes 
subset of common regex syntax but added back a few features. In 
turn, the editors of XML Schema Datatypes felt that Perl 5 regex 
was a de-facto standard, but narrowed this standard down considerably.

Unfortunately they kept much of the extraneous syntactic sugar
like \p{...} and \s, \d etc. that is very easy to replace with generic
character classes [...]. But they removed other features such as 
boundary matches \b, non-capturing groups (?:...), postitve and 
negative look-ahead (?=...) (?!...), that are extremely hard to 
emulate but invaluable when using regex for string processing in 
XPath fn:replace rather than simple string pattern validation as it
is only needed in XML Schema Datatypes.

MOTION #1: XPath 2.0 F&O to adopted a more complete set of Perl 5 
regex features into the standard, regardless of XML Schema. 

XML Schema's use case for regex is simply validation of complete 
strings. XPath and XSLT use cases are far beyond simple validation, 
and cover actual string parsing and processing including head-matches
or substring seek. 

MOTION #2 If motion #1 is not accepted, XPath should PERMIT XPath
processors to offer a more complete subset of Perl 5 regex syntax 
as a user option. 

Since all XPath processors can use readily available regex 
implementations rather than developing their own ones, it is 
paradoxical that they have to spend considerable effort to restricting
the regex language to the little that is permitted by XPath 2.0 F&O.
Everyone has to do more work to deal with the restriction than they
would have to do to deal with a more complete regex syntax.


WORKAROUND

For the match and replace XPath functions easy workarounds exist by
simply not using them and using user extension functions instead.
Those are easy to write in Saxon. However, user extension functions
severely impact the interoperability, even more so than by allowing
the XPath processors to offer a more complete set of regex features.


DETAIL OF MISSING FEATURES

Below is a detailed analysis of regex features starting from the 
documentation of java.util.regex.Pattern syntax (which in turn is
largely Perl 5.) I am also refering to Perl 5 and Microsoft .NET
regular expression language specification to make sure that the 
feature proposed for addition is widely supported.


BOUNDARY MATCHES

\b A word boundary 
\B A non-word boundary 
\A The beginning of the input (needed besides ^ when in multi-line mode)
\G The end of the previous match 
\Z The end of the input but for the final terminator, if any 
\z The end of the input 

All of the above boundary matches are supported in Perl 5, java.util.regex,
and Microsoft .NET, as well as many, many orther regex implementations.

Boundary matches are very difficult to emulate by any other means, 
and at the same time, boundary matches were part of even the earliest 
and otherwise most spartanic regex formalisms as in Unix grep and sed.

MOTION #1.1.1 Add all of the boundary matches with the possible 
exception of \G.

MOTION #1.1.2 if #1.1.1 is not accepted, at the very least the 
word bounday match \b should be added.

Rationale: \G could be emulated by feeding a substring to the 
subsequent match and is supported automatically by the XSLT
analyze-string instruction.


NON-CAPTURING GROUPS

(?:X) X, as a non-capturing group 

Widely supported in Perl 5, java.util.regex, .NET, and most others. 
Important to use capturing groups more efficiently. Non caturing 
parentheses are needed very frequently in conjunction with the 
alternative operation, e.g. (?:STATE|EMPLOY)MENT.

It is not only waste of memory to make all those parentheses captureing,
but it is also a problem when wanting to use back-references as you
can have at most 10 capturing groups for use with back-references.

MOTION #1.2 to add non capturing groups.


(?idmsux-idmsux:X)   X, as a non-capturing group with the given flags on - off 

This would allow turning flags on and off inside more comples regex. 
Supported by Perl, java, and .NET.

MOTION #1.3 to add these flags for non-capturing groups


(?=X) X, via zero-width positive lookahead 
(?!X) X, via zero-width negative lookahead 

These constructs are very useful for partial head matching. Supported
by Perl, java, and .NET

MOTION #1.4 to add positive and negative lookahead


(?<=X) X, via zero-width positive lookbehind 
(?<!X) X, via zero-width negative lookbehind 
(?>X) X, as an independent, non-capturing group 

The final three are a bit more esoteric, but they are supported by
Perl, java, and .NET.


-- 
Gunther Schadow, M.D., Ph.D.                  gschadow@regenstrief.org
Associate Professor           Indiana University School of Informatics
Regenstrief Institute, Inc.      Indiana University School of Medicine
tel:1(317)630-7960                       http://aurora.regenstrief.org
Received on Thursday, 16 December 2004 10:49:49 UTC