final resolutions of 2e comments (was Re: are regex quantifiers ambiguous?)

Thanks for the comment.

The error you point out (yes, it's certainly an error)
is in the Rec Comments list as R-41; no erratum for it
has been drafted, and I regret to inform you that the
Working Group is unwilling to delay 2E to add one more
correction, particularly since implementers have not
reported trouble detecting the problem and doing the 
right thing. 

It *is* on the list and it *will* get fixed.  But there
are only so many things we can do at a time.  I hope
you can understand.

-C. M. Sperberg-McQueen

On Fri, 2004-02-06 at 08:29, C. M. Sperberg-McQueen wrote:
> While thinking about our regular expression language yesterday and
> this morning, I have run into something that puzzles me about
> production [10] and the definitions of metacharacter and normal
> character.
> 
> Consider the regular expression x{5}, which I believe should match a
> sequence of five 'x' characters.  Somewhat to my surprise, my parser
> tells me this regex is ambiguous.
> 
> Parse 1:
> Start symbol:   <regex>
> By [1]:         <branch>
> By [2]:         <piece>
> By [3]:         <atom> <quantifier>
> By [9]:         <char> <quantifier>
> By [10]:        x <quantifier>
> By [4]:         x { <quantity> }
> By [5]:         x { <quantExact> }
> By [8]:         x { 5 }
> 
> Parse 2:
> Start symbol:   <regex>
> By [1]:         <branch>
> By [2]:         <piece> <piece> <piece> <piece>
> By [3]:         <atom> <atom> <atom> <atom> 
> By [9]:         <char> <char> <char> <char> 
> By [10]:        x { 5 } 
> 
> I appear to be missing something crucial here; I can't believe we have
> had a fundamental ambiguity in our spec for so long without any
> implementors noticing it.  (I believe the relevant parts of the
> grammar are the same in 1.0 and in 2E.  At least, the 2E I just
> checked at [1] indicates no changes from 1.0.)
> 
> [1]
> http://www.w3.org/XML/Group/2003/09/xmlschema-2/datatypes-with-errata.html#regexs)
> 
> The problem seems to me to be that we define 'normal character' in
> prose as:
> 
>   [Definition:] A normal character is any XML character that is not a
>   metacharacter. In ·regular expression·s, a normal character is an
>   atom that denotes the singleton set of strings containing only
>   itself.
> 
> We define 'metacharacter' in turn as 
> 
>   [Definition:] A metacharacter is either ., \, ?, *, +, {, } (, ), [
>   or ]. These characters have special meanings in regular expressions,
>   but can be escaped to form atoms that denote the sets of strings
>   containing only themselves, i.e., an escaped metacharacter behaves
>   like a normal character.
> 
> But production [10] defines Char (the non-terminal we use to denote
> normal characters) thus:
> 
>   [10] Char ::= [^.\?*+()|#x5B#x5D]
> 
> Both definitions are 'any character but ... (list) ...' but the 
> lists are different.
> 
> Prose:    . \ ? * + { } ( )   [ ]
> Grammar:  . \ ? * +     ( ) | [ ]
> 
> The grammar rule [10] seems to omit curly braces, and to include
> vertical bar, and the prose vice versa.
> 
> I think the correct set of metacharacters is the union of the two
> sets; can someone else look into this and confirm?
> 
> -C. M. Sperberg-McQueen
> 
> 

Received on Tuesday, 13 July 2004 21:18:13 UTC