- From: <bugzilla@jessica.w3.org>
- Date: Tue, 18 Jan 2011 01:38:35 +0000
- To: www-xml-schema-comments@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11125 --- Comment #5 from C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> 2011-01-18 01:38:35 UTC --- Some additional data comes from looking carefully at things (again) and checking (again) some regexes using Xerophily (aka MSM's regex parser): (1) In comment 4, DE summarizes some results for 1.0 and 1.1 rules for regexes, but in a couple of cases the results given don't agree with what Xerophily says. These are correct: [-+] ok ok [+-] ok x [a-z+-] ok x [a-z-+] x ok But these two are not correct: [--z] ok x [a--k--z] ok x Neither of these is accepted by 1.0, because in 1.0 an unescaped hyphen is not allowed as the end-point of a range, and may itself be a (single-character) range only at the beginning of end of a positive character group. (2) The grammar in 1.1 has an ambiguity we had not detected before, which may affect the rule after production 81. A single-character escape (e.g. \n) satisfies both the non-terminal singleChar and the non-terminal charClassEsc, each of which appear on the right-hand side of the rule for charGroupPart, so there are two different ways in which a single-character escape can be a charGroupPart. In the case of \n and others of the class, the difference is semantically unimportant: in both cases, the enclosing character group includes the character indicated. (As a result, Xerophily does not register this ambiguity: both parses produce the same abstract syntax tree.) But in the case of \- the ambiguity may have consequences. The prose following production 81 imposes certain constraints on charGroupPart strings that begin with a singleChar followed by a hyphen. But \- can be either a singleChar or not a singleChar; the rule says nothing about a charGroupPart which begins with a charClassEsc which happens to be a singleCharEsc, and the rule may be thought not to apply to that parse. For this reason, Xerophily currently produces two parses for [\--z]: one for the range from hyphen to z, and one for the character class containing hyphen (escaped), hyphen (unescaped), and z. We either need to remove the ambiguity, or we need to recast the wording of the prose rule to make it cover the case. Having thought about this a bit, I think I favor changing the prose after production 81 along the lines suggested, to specify that if a charGroup part begins with a singleChar (or a charClassEsc which is a singleCharEsc) followed by a hyphen, then one of the following must be true: (1) The hyphen is followed by [ and the hyphen indicates character-class subtraction. (2) The hyphen is followed by ] and it is treated as a singleChar, the last charGroupPart of the character group. (3) The hyphen is followed by -[ and it is treated as a singleChar, the last charGroupPart of the character group. (4) The hyphen is followed by a singleChar and indicates a range. Personally, I'd like to get rid of the constraint forbidding unescaped hyphens as character-range endpoints, but I'm not sure we can do so without adding ambiguity. So in addition to the change just outlined, I think I favor asking each member of the WG to contribute two or more regular expressions involving character classes, with a strong preference to the twisted, the devious, and the deceptively simple-looking, and that we test the 1.0 and current 1.1 grammars and the proposed change(s), on all the samples provided as well as on a few thousand randomly generated test strings. We should also decide whether we want to eliminate the ambiguity identified above or not. -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Tuesday, 18 January 2011 01:38:37 UTC