An easy-to-miss error from Norm Tovey-Walsh on 2023-11-04 (public-ixml@w3.org from November 2023)

From: Norm Tovey-Walsh <norm@saxonica.com>
Date: Sat, 04 Nov 2023 09:46:13 +0000
To: ixml <public-ixml@w3.org>
Message-ID: <m2cywp3k6i.fsf@saxonica.com>

Good morning.

Consider the following grammar for an odd little iXML-like language:

      S = s, alts.
   alts = alt++(';', s).
    alt = term**(',', s).
   term = ('a' ; 'b'), s.
     -s = (-' '|comment)*.
comment = -'{', ~[{}]*, -'}'.

This comes from a unit test. (I can’t actually recall what I was
testing, but this was some minimal subset of iXML that demonstrated a
bug.)

That’s not the interesting part. The interesting part is the
easy-to-overlook error. I’ll give you a moment to see if you can spot
it.















                  S P O I L E R  B R E A K















Here’s a hint. It’s in this rule: comment = -'{', ~[{}]*, -'}'.






That middle bit is a exclusion that contains an empty comment, not an
exclusion that excludes ‘{‘ and ‘}’. The input for this unit test is
“a;{comment} b” so the difference between “any character at all” and
“any character except ‘{‘ or ‘}’” never comes into play.

Because we’re all used to reading regular expressions, it’s really
*really* easy to misread character sets. “[{}]” just isn’t what you
think it is. Neither is “["cat"|"dog"]”. I’m not sure we can do anything
about it at this point though. We could exclude comments from character
sets, but I don’t think that would be a very good trade-off, really.

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica

Received on Saturday, 4 November 2023 10:08:31 UTC