[Bug 19778] Wrong regex for integer from bugzilla@jessica.w3.org on 2013-05-24 (public-script-coord@w3.org from April to June 2013)

From: <bugzilla@jessica.w3.org>
Date: Fri, 24 May 2013 05:32:59 +0000
To: public-script-coord@w3.org
Message-ID: <bug-19778-3890-HpKMthSX1x@http.www.w3.org/Bugs/Public/>

https://www.w3.org/Bugs/Public/show_bug.cgi?id=19778

Nils Barth <nbarth+w3bugzilla@google.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |nbarth+w3bugzilla@google.co
                   |                            |m

--- Comment #2 from Nils Barth <nbarth+w3bugzilla@google.com> ---
Thanks for fixing this!

Stylistically, I think this would be even clearer (and slightly shorter):
    -?([1-9][0-9]*|0[Xx][0-9A-Fa-f]+|0[0-7]*)
...namely having 3 alternatives – decimal, hex, octal – rather than nesting
((hex + octal) + decimal).

To elaborate:
The spec was formally correct, since it requires greedy matching (longest
possible match), but in practice the previous regex doesn't work.

Using regexes (and production rules) in the spec that can be directly used in
real-world engines helps avoid errors and make validation simpler: it's much
easier to check that two regexes are identical than equivalent.

This caused the following bug in Chromium, now fixed (using revised regex):
Chromium 243263: IDL lexer mistokenizes hexadecimals
https://code.google.com/p/chromium/issues/detail?id=243263

The problem is that most regex engines (including Perl and Python) have eager
matching on alternation, hence matching isn't completely greedy (as required by
the spec).
In this case the octal pattern always matches before the hex, and thus hex
numbers are tokenized as 0 + identifier, e.g., '0x123' becomes integer '0' +
identifier 'x123'.

As Robin suggested, this problem is avoided by swapping the patterns, putting
the longer hex pattern before the octal pattern, so the eager matching ends up
being greedy.


We can make it slightly clearer by splitting by base (instead of combining the
leading 0 in hex and octal), which makes it a bit more legible to my eye and
slightly shorter (b/c of no nesting).
I don't know if there's performance impact either way, and it's not necessary
for correctness.


For reference, the Regular Expressions Cookbook has a recipe in 7.3 Numeric
Constants (p. 413), where they split by base (in their re order doesn't matter
because they require word boundaries, but for tokenizing we don't have this).
http://books.google.com/books?id=6k7IfACN_P8C&pg=PA413

Completely unambiguous would be (decimal|hex|octal|0), but combining octal with
zero is common and fine.

So the revised regex in fine and works correctly in real-world engines, though
I'd suggest a stylistic revision to:
    -?([1-9][0-9]*|0[Xx][0-9A-Fa-f]+|0[0-7]*)

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Received on Friday, 24 May 2013 05:33:01 UTC