- From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
- Date: 13 Jul 2004 19:17:13 -0600
- To: W3C XML Schema Comments list <www-xml-schema-comments@w3.org>
- Cc: W3C XML Schema IG <w3c-xml-schema-ig@w3.org>
Thanks for the comment. The error you point out (yes, it's certainly an error) is in the Rec Comments list as R-41; no erratum for it has been drafted, and I regret to inform you that the Working Group is unwilling to delay 2E to add one more correction, particularly since implementers have not reported trouble detecting the problem and doing the right thing. It *is* on the list and it *will* get fixed. But there are only so many things we can do at a time. I hope you can understand. -C. M. Sperberg-McQueen On Fri, 2004-02-06 at 08:29, C. M. Sperberg-McQueen wrote: > While thinking about our regular expression language yesterday and > this morning, I have run into something that puzzles me about > production [10] and the definitions of metacharacter and normal > character. > > Consider the regular expression x{5}, which I believe should match a > sequence of five 'x' characters. Somewhat to my surprise, my parser > tells me this regex is ambiguous. > > Parse 1: > Start symbol: <regex> > By [1]: <branch> > By [2]: <piece> > By [3]: <atom> <quantifier> > By [9]: <char> <quantifier> > By [10]: x <quantifier> > By [4]: x { <quantity> } > By [5]: x { <quantExact> } > By [8]: x { 5 } > > Parse 2: > Start symbol: <regex> > By [1]: <branch> > By [2]: <piece> <piece> <piece> <piece> > By [3]: <atom> <atom> <atom> <atom> > By [9]: <char> <char> <char> <char> > By [10]: x { 5 } > > I appear to be missing something crucial here; I can't believe we have > had a fundamental ambiguity in our spec for so long without any > implementors noticing it. (I believe the relevant parts of the > grammar are the same in 1.0 and in 2E. At least, the 2E I just > checked at [1] indicates no changes from 1.0.) > > [1] > http://www.w3.org/XML/Group/2003/09/xmlschema-2/datatypes-with-errata.html#regexs) > > The problem seems to me to be that we define 'normal character' in > prose as: > > [Definition:] A normal character is any XML character that is not a > metacharacter. In ·regular expression·s, a normal character is an > atom that denotes the singleton set of strings containing only > itself. > > We define 'metacharacter' in turn as > > [Definition:] A metacharacter is either ., \, ?, *, +, {, } (, ), [ > or ]. These characters have special meanings in regular expressions, > but can be escaped to form atoms that denote the sets of strings > containing only themselves, i.e., an escaped metacharacter behaves > like a normal character. > > But production [10] defines Char (the non-terminal we use to denote > normal characters) thus: > > [10] Char ::= [^.\?*+()|#x5B#x5D] > > Both definitions are 'any character but ... (list) ...' but the > lists are different. > > Prose: . \ ? * + { } ( ) [ ] > Grammar: . \ ? * + ( ) | [ ] > > The grammar rule [10] seems to omit curly braces, and to include > vertical bar, and the prose vice versa. > > I think the correct set of metacharacters is the union of the two > sets; can someone else look into this and confirm? > > -C. M. Sperberg-McQueen > >
Received on Tuesday, 13 July 2004 21:18:13 UTC