- From: Norm Tovey-Walsh <norm@saxonica.com>
- Date: Sat, 04 Nov 2023 09:46:13 +0000
- To: ixml <public-ixml@w3.org>
- Message-ID: <m2cywp3k6i.fsf@saxonica.com>
Good morning.
Consider the following grammar for an odd little iXML-like language:
S = s, alts.
alts = alt++(';', s).
alt = term**(',', s).
term = ('a' ; 'b'), s.
-s = (-' '|comment)*.
comment = -'{', ~[{}]*, -'}'.
This comes from a unit test. (I can’t actually recall what I was
testing, but this was some minimal subset of iXML that demonstrated a
bug.)
That’s not the interesting part. The interesting part is the
easy-to-overlook error. I’ll give you a moment to see if you can spot
it.
S P O I L E R B R E A K
Here’s a hint. It’s in this rule: comment = -'{', ~[{}]*, -'}'.
That middle bit is a exclusion that contains an empty comment, not an
exclusion that excludes ‘{‘ and ‘}’. The input for this unit test is
“a;{comment} b” so the difference between “any character at all” and
“any character except ‘{‘ or ‘}’” never comes into play.
Because we’re all used to reading regular expressions, it’s really
*really* easy to misread character sets. “[{}]” just isn’t what you
think it is. Neither is “["cat"|"dog"]”. I’m not sure we can do anything
about it at this point though. We could exclude comments from character
sets, but I don’t think that would be a very good trade-off, really.
Be seeing you,
norm
--
Norm Tovey-Walsh
Saxonica
Received on Saturday, 4 November 2023 10:08:31 UTC