- From: Gunther Schadow <gunther@aurora.regenstrief.org>
- Date: Wed, 15 Dec 2004 13:33:43 -0500
- To: public-qt-comments@w3.org
This comment is against the XQuery 1.0 and XPath 2.0 Functions and Operators W3C Working Draft 29 October 2004. An extended version of this comment has been has been submitted as an XSLT2.0 comment as well to include considerations that are specific for XSLT. ISSUE The regular expression syntax allowed by F&O is too limited to the extent that it limits XSLT's analyze-string instrucion too much. The editors of XPath 2.0 F&O have refered to XML Schema Datatypes subset of common regex syntax but added back a few features. In turn, the editors of XML Schema Datatypes felt that Perl 5 regex was a de-facto standard, but narrowed this standard down considerably. Unfortunately they kept much of the extraneous syntactic sugar like \p{...} and \s, \d etc. that is very easy to replace with generic character classes [...]. But they removed other features such as boundary matches \b, non-capturing groups (?:...), postitve and negative look-ahead (?=...) (?!...), that are extremely hard to emulate but invaluable when using regex for string processing in XPath fn:replace rather than simple string pattern validation as it is only needed in XML Schema Datatypes. MOTION #1: XPath 2.0 F&O to adopted a more complete set of Perl 5 regex features into the standard, regardless of XML Schema. XML Schema's use case for regex is simply validation of complete strings. XPath and XSLT use cases are far beyond simple validation, and cover actual string parsing and processing including head-matches or substring seek. MOTION #2 If motion #1 is not accepted, XPath should PERMIT XPath processors to offer a more complete subset of Perl 5 regex syntax as a user option. Since all XPath processors can use readily available regex implementations rather than developing their own ones, it is paradoxical that they have to spend considerable effort to restricting the regex language to the little that is permitted by XPath 2.0 F&O. Everyone has to do more work to deal with the restriction than they would have to do to deal with a more complete regex syntax. WORKAROUND For the match and replace XPath functions easy workarounds exist by simply not using them and using user extension functions instead. Those are easy to write in Saxon. However, user extension functions severely impact the interoperability, even more so than by allowing the XPath processors to offer a more complete set of regex features. DETAIL OF MISSING FEATURES Below is a detailed analysis of regex features starting from the documentation of java.util.regex.Pattern syntax (which in turn is largely Perl 5.) I am also refering to Perl 5 and Microsoft .NET regular expression language specification to make sure that the feature proposed for addition is widely supported. BOUNDARY MATCHES \b A word boundary \B A non-word boundary \A The beginning of the input (needed besides ^ when in multi-line mode) \G The end of the previous match \Z The end of the input but for the final terminator, if any \z The end of the input All of the above boundary matches are supported in Perl 5, java.util.regex, and Microsoft .NET, as well as many, many orther regex implementations. Boundary matches are very difficult to emulate by any other means, and at the same time, boundary matches were part of even the earliest and otherwise most spartanic regex formalisms as in Unix grep and sed. MOTION #1.1.1 Add all of the boundary matches with the possible exception of \G. MOTION #1.1.2 if #1.1.1 is not accepted, at the very least the word bounday match \b should be added. Rationale: \G could be emulated by feeding a substring to the subsequent match and is supported automatically by the XSLT analyze-string instruction. NON-CAPTURING GROUPS (?:X) X, as a non-capturing group Widely supported in Perl 5, java.util.regex, .NET, and most others. Important to use capturing groups more efficiently. Non caturing parentheses are needed very frequently in conjunction with the alternative operation, e.g. (?:STATE|EMPLOY)MENT. It is not only waste of memory to make all those parentheses captureing, but it is also a problem when wanting to use back-references as you can have at most 10 capturing groups for use with back-references. MOTION #1.2 to add non capturing groups. (?idmsux-idmsux:X) X, as a non-capturing group with the given flags on - off This would allow turning flags on and off inside more comples regex. Supported by Perl, java, and .NET. MOTION #1.3 to add these flags for non-capturing groups (?=X) X, via zero-width positive lookahead (?!X) X, via zero-width negative lookahead These constructs are very useful for partial head matching. Supported by Perl, java, and .NET MOTION #1.4 to add positive and negative lookahead (?<=X) X, via zero-width positive lookbehind (?<!X) X, via zero-width negative lookbehind (?>X) X, as an independent, non-capturing group The final three are a bit more esoteric, but they are supported by Perl, java, and .NET. -- Gunther Schadow, M.D., Ph.D. gschadow@regenstrief.org Associate Professor Indiana University School of Informatics Regenstrief Institute, Inc. Indiana University School of Medicine tel:1(317)630-7960 http://aurora.regenstrief.org
Received on Thursday, 16 December 2004 10:49:49 UTC