[Bug 4106] [F+O] regex syntax: position of backreference from bugzilla@wiggum.w3.org on 2006-12-21 (public-qt-comments@w3.org from December 2006)

From: <bugzilla@wiggum.w3.org>
Date: Thu, 21 Dec 2006 23:35:46 +0000
To: public-qt-comments@w3.org
CC:
Message-Id: <E1GxXRy-0008Qp-Fq@wiggum.w3.org>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=4106

           Summary: [F+O] regex syntax: position of backreference
           Product: XPath / XQuery / XSLT
           Version: Proposed Recommendation
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Functions and Operators
        AssignedTo: ashok.malhotra@oracle.com
        ReportedBy: mike@saxonica.com
         QAContact: public-qt-comments@w3.org


In the syntax of regular expressions, a backreference is currently allowed as a
charClassEsc. As such it can appear either outside square brackets, for example 

(abc)\1

or within square brackets

(abc)[\1]

However, it doesn't make sense to allow a backreference within square brackets,
because constructs allowed within square brackets always match a single
character (perhaps one of a set of possible characters, but never a sequence of
more than one character), while a back-reference will in general match a
sequence of characters.

I think that backreferences should appear in the syntax at the level of an
atom:

[9] atom ::= Char | charClass | ( '(' regExp ')' ) | backReference

I have not been able to find a similar restriction documented for REs in Perl,
Java, or .NET. However, none of these languages attempt to define the syntax of
regular expressions using a BNF grammar, or to give a rigorous exposition of
the semantics. Experiments with Java suggest that "(abc)[\1]" is accepted as a
valid regular expression, but its semantics appear to be undefined: I am unable
to identify any string that it matches.

Received on Thursday, 21 December 2006 23:36:00 UTC