[Bug 11125] Regex grammar for 1.1 renders some 1.0 regexes invalid

http://www.w3.org/Bugs/Public/show_bug.cgi?id=11125

--- Comment #5 from C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> 2011-01-18 01:38:35 UTC ---
Some additional data comes from looking carefully at things (again) and
checking (again) some regexes using Xerophily (aka MSM's regex parser):

(1) In comment 4, DE summarizes some results for 1.0 and 1.1 rules for regexes,
but in a couple of cases the results given don't agree with what Xerophily
says.  These are correct:

[-+]               ok   ok
[+-]               ok   x
[a-z+-]            ok   x
[a-z-+]            x    ok

But these two are not correct:

[--z]              ok   x
[a--k--z]          ok   x        

Neither of these is accepted by 1.0, because in 1.0 an unescaped hyphen is not
allowed as the end-point of a range, and may itself be a (single-character)
range only at the beginning of end of a positive character group. 

(2) The grammar in 1.1 has an ambiguity we had not detected before, which may
affect the rule after production 81.  A single-character escape (e.g. \n)
satisfies both the non-terminal singleChar and the non-terminal charClassEsc,
each of which appear on the right-hand side of the rule for charGroupPart, so
there are two different ways in which a single-character escape can be a
charGroupPart.  In the case of \n and others of the class, the difference is
semantically unimportant: in both cases, the enclosing character group includes
the character indicated.  (As a result, Xerophily does not register this
ambiguity: both parses produce the same abstract syntax tree.)  

But in the case of \- the ambiguity may have consequences.  The prose following
production 81 imposes certain constraints on charGroupPart strings that begin
with a singleChar followed by a hyphen.  But \- can be either a singleChar or
not a singleChar; the rule says nothing about a charGroupPart which begins with
a charClassEsc which happens to be a singleCharEsc, and the rule may be thought
not to apply to that parse.  For this reason, Xerophily currently produces two
parses for [\--z]:  one for the range from hyphen to z, and one for the
character class containing hyphen (escaped), hyphen (unescaped), and z.

We either need to remove the ambiguity, or we need to recast the wording of the
prose rule to make it cover the case.

Having thought about this a bit, I think I favor changing the prose after
production 81 along the lines suggested, to specify that if a charGroup part
begins with a singleChar (or a charClassEsc which is a singleCharEsc) followed
by a hyphen, then one of the following must be true:

  (1) The hyphen is followed by [ and the hyphen indicates character-class
subtraction.

  (2) The hyphen is followed by ] and it is treated as a singleChar, the last
charGroupPart of the character group.

  (3) The hyphen is followed by -[ and it is treated as a singleChar, the last
charGroupPart of the character group.

  (4) The hyphen is followed by a singleChar and indicates a range.

Personally, I'd like to get rid of the constraint forbidding unescaped hyphens
as character-range endpoints, but I'm not sure we can do so without adding
ambiguity.

So in addition to the change just outlined, I think I favor asking each member
of the WG to contribute two or more regular expressions involving character
classes, with a strong preference to the twisted, the devious, and the
deceptively simple-looking, and that we test the 1.0 and current 1.1 grammars
and the proposed change(s), on all the samples provided as well as on a few
thousand randomly generated test strings.

We should also decide whether we want to eliminate the ambiguity identified
above or not.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Tuesday, 18 January 2011 01:38:37 UTC