[Bug 5348] [F&O] Back-references: "sufficiently many"

http://www.w3.org/Bugs/Public/show_bug.cgi?id=5348





------- Comment #1 from mike@saxonica.com  2008-03-11 15:44 -------
In action A-358-06 I was asked to review what Perl does about this.

There is of course no formal specification of Perl. The man page 

http://www.perl.com/doc/manual/html/pod/perlre.html

states: "Within the pattern, \10, \11, etc. refer back to substrings if there
have been at least that many left parentheses before the backreference."

This clearly doesn't make sense. If \10 occurs after the 10th left paren, but
before the right paren that matches the 10th left paren, then it cannot "refer
back" to that substring.

The Java 5 statement is informal but more defensible: "In this class, \1
through \9 are always interpreted as back references, and a larger number is
accepted as a back reference if at least that many subexpressions exist at that
point in the regular expression, otherwise the parser will drop digits until
the number is smaller or equal to the existing number of groups or it is one
digit." But it still has the problem our text has, that you can be in the
middle of subexpression 10 even though there have been 15 completed
subexpressions.

Pragmatically, with Java 5:

Pattern.matches("(X)(\11)", "XX1") 

    true - the backreference is to subexp 1

Pattern.matches("(X)(2)?(3)?(4)?(5)?(6)?(7)?(8)?(9)?(10)?(Y)(\\11)", "XYY")

    true - the backreference is to subexp 11

Pattern.matches("(X)(2)?(3)?(4)?(5)?(6)?(7)?(8)?(9)?(10)?((Y)(\\11))", "XYX1")

    false. Here the back-reference \11 appears within the 11th subexpression. I
can't find any string that matches this regex. It seems to be treating it as a
reference to subexpression 11, which can never be matched, rather than treating
it as a reference to subexpression 1.

Pattern.matches("(X)(2)?(3)?(4)?(5)?(6)?(7)?(8)?(9)?(10)?((Y)(\\12))", "XYY")

   true. The back-reference \12 is recognized as referring to subexp 12,
although it actually appears within subexp 11.

So the (provisional) conclusion for Java is that it does what Perl says: it
recognizes the backreference \11 if there are 11 open parens; if there haven't
been 11 close parens then the back-reference will never match anything.

I don't immediately have the ability to test what Perl does.

Received on Tuesday, 11 March 2008 15:45:17 UTC