are regex quantifiers ambiguous? from C. M. Sperberg-McQueen on 2004-02-06 (www-xml-schema-comments@w3.org from January to March 2004)

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: 06 Feb 2004 08:29:33 -0700
To: W3C XML Schema Comments list <www-xml-schema-comments@w3.org>
Message-Id: <1076081373.7122.7.camel@localhost>

While thinking about our regular expression language yesterday and
this morning, I have run into something that puzzles me about
production [10] and the definitions of metacharacter and normal
character.

Consider the regular expression x{5}, which I believe should match a
sequence of five 'x' characters.  Somewhat to my surprise, my parser
tells me this regex is ambiguous.

Parse 1:
Start symbol:   <regex>
By [1]:         <branch>
By [2]:         <piece>
By [3]:         <atom> <quantifier>
By [9]:         <char> <quantifier>
By [10]:        x <quantifier>
By [4]:         x { <quantity> }
By [5]:         x { <quantExact> }
By [8]:         x { 5 }

Parse 2:
Start symbol:   <regex>
By [1]:         <branch>
By [2]:         <piece> <piece> <piece> <piece>
By [3]:         <atom> <atom> <atom> <atom> 
By [9]:         <char> <char> <char> <char> 
By [10]:        x { 5 } 

I appear to be missing something crucial here; I can't believe we have
had a fundamental ambiguity in our spec for so long without any
implementors noticing it.  (I believe the relevant parts of the
grammar are the same in 1.0 and in 2E.  At least, the 2E I just
checked at [1] indicates no changes from 1.0.)

[1]
http://www.w3.org/XML/Group/2003/09/xmlschema-2/datatypes-with-errata.html#regexs)

The problem seems to me to be that we define 'normal character' in
prose as:

  [Definition:] A normal character is any XML character that is not a
  metacharacter. In ·regular expression·s, a normal character is an
  atom that denotes the singleton set of strings containing only
  itself.

We define 'metacharacter' in turn as 

  [Definition:] A metacharacter is either ., \, ?, *, +, {, } (, ), [
  or ]. These characters have special meanings in regular expressions,
  but can be escaped to form atoms that denote the sets of strings
  containing only themselves, i.e., an escaped metacharacter behaves
  like a normal character.

But production [10] defines Char (the non-terminal we use to denote
normal characters) thus:

  [10] Char ::= [^.\?*+()|#x5B#x5D]

Both definitions are 'any character but ... (list) ...' but the 
lists are different.

Prose:    . \ ? * + { } ( )   [ ]
Grammar:  . \ ? * +     ( ) | [ ]

The grammar rule [10] seems to omit curly braces, and to include
vertical bar, and the prose vice versa.

I think the correct set of metacharacters is the union of the two
sets; can someone else look into this and confirm?

-C. M. Sperberg-McQueen

Received on Friday, 6 February 2004 10:30:52 UTC