- From: <bugzilla@jessica.w3.org>
- Date: Fri, 24 May 2013 05:32:59 +0000
- To: public-script-coord@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=19778 Nils Barth <nbarth+w3bugzilla@google.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |nbarth+w3bugzilla@google.co | |m --- Comment #2 from Nils Barth <nbarth+w3bugzilla@google.com> --- Thanks for fixing this! Stylistically, I think this would be even clearer (and slightly shorter): -?([1-9][0-9]*|0[Xx][0-9A-Fa-f]+|0[0-7]*) ...namely having 3 alternatives – decimal, hex, octal – rather than nesting ((hex + octal) + decimal). To elaborate: The spec was formally correct, since it requires greedy matching (longest possible match), but in practice the previous regex doesn't work. Using regexes (and production rules) in the spec that can be directly used in real-world engines helps avoid errors and make validation simpler: it's much easier to check that two regexes are identical than equivalent. This caused the following bug in Chromium, now fixed (using revised regex): Chromium 243263: IDL lexer mistokenizes hexadecimals https://code.google.com/p/chromium/issues/detail?id=243263 The problem is that most regex engines (including Perl and Python) have eager matching on alternation, hence matching isn't completely greedy (as required by the spec). In this case the octal pattern always matches before the hex, and thus hex numbers are tokenized as 0 + identifier, e.g., '0x123' becomes integer '0' + identifier 'x123'. As Robin suggested, this problem is avoided by swapping the patterns, putting the longer hex pattern before the octal pattern, so the eager matching ends up being greedy. We can make it slightly clearer by splitting by base (instead of combining the leading 0 in hex and octal), which makes it a bit more legible to my eye and slightly shorter (b/c of no nesting). I don't know if there's performance impact either way, and it's not necessary for correctness. For reference, the Regular Expressions Cookbook has a recipe in 7.3 Numeric Constants (p. 413), where they split by base (in their re order doesn't matter because they require word boundaries, but for tokenizing we don't have this). http://books.google.com/books?id=6k7IfACN_P8C&pg=PA413 Completely unambiguous would be (decimal|hex|octal|0), but combining octal with zero is common and fine. So the revised regex in fine and works correctly in real-world engines, though I'd suggest a stylistic revision to: -?([1-9][0-9]*|0[Xx][0-9A-Fa-f]+|0[0-7]*) -- You are receiving this mail because: You are on the CC list for the bug.
Received on Friday, 24 May 2013 05:33:01 UTC