[heycam/webidl] Ellipsis token quirkiness? (#812)

The ellipsis token has a unique property: it’s the only terminal whose source text isn’t matched by any of the seven regular expressions given at the start of the grammar section.

This isn’t, to my knowledge, an error: the regular expressions only describe the ‘named terminals,’ which are considered distinct from the unnamed terminals that are given in literal teletype throughout the grammar and which take precedence:

> If the longest possible match could match one of the above named terminal symbols or one of the other terminal symbols from the grammar, it must be tokenized as the latter.

That said, this has big footgun energy imo. It’s tempting to lex using the regex as your goal and then refine the result by changing the type to 'unnamed' if the value is a member of that set.  This initially appears to be possible because every unnamed nonterminal can be matched as one of the named terminals first ... except the ellipsis. I think it’s pretty easy to miss that when there’s one exception out of 83.

My suggestion would be to change the definition of the other named token in order to restore the property that every nonterminal can be matched with these patterns:

`/[^\t\n\r 0-9A-Za-z]/` -> `/\.{3}|[^\t\n\r 0-9A-Za-z]/`

It’s possible this doesn’t matter to other folks — the spec isn’t actually ambiguous here or anything — in which case feel free to close this, but it also seems possible this terminal being unique in this regard was unintentional to begin with.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/heycam/webidl/issues/812

Received on Saturday, 5 October 2019 06:04:19 UTC