- From: Norm Tovey-Walsh <norm@saxonica.com>
- Date: Sat, 04 Nov 2023 09:46:13 +0000
- To: ixml <public-ixml@w3.org>
- Message-ID: <m2cywp3k6i.fsf@saxonica.com>
Good morning. Consider the following grammar for an odd little iXML-like language: S = s, alts. alts = alt++(';', s). alt = term**(',', s). term = ('a' ; 'b'), s. -s = (-' '|comment)*. comment = -'{', ~[{}]*, -'}'. This comes from a unit test. (I can’t actually recall what I was testing, but this was some minimal subset of iXML that demonstrated a bug.) That’s not the interesting part. The interesting part is the easy-to-overlook error. I’ll give you a moment to see if you can spot it. S P O I L E R B R E A K Here’s a hint. It’s in this rule: comment = -'{', ~[{}]*, -'}'. That middle bit is a exclusion that contains an empty comment, not an exclusion that excludes ‘{‘ and ‘}’. The input for this unit test is “a;{comment} b” so the difference between “any character at all” and “any character except ‘{‘ or ‘}’” never comes into play. Because we’re all used to reading regular expressions, it’s really *really* easy to misread character sets. “[{}]” just isn’t what you think it is. Neither is “["cat"|"dog"]”. I’m not sure we can do anything about it at this point though. We could exclude comments from character sets, but I don’t think that would be a very good trade-off, really. Be seeing you, norm -- Norm Tovey-Walsh Saxonica
Received on Saturday, 4 November 2023 10:08:31 UTC