The longest token rule

We have for many years had the rule in A.2

When tokenizing, the longest possible match that is consistent with the EBNF is used."

and I have often wondered if there were cases where the phrase "that is consistent with the EBNF" actually affected the outcome. It suggests that the tokenization is sensitive to the grammatical context, which is a considerable complication.

I have submitted a test case MapConstructor-025 which does this:

let $m := map{'a':1} return map:size(map{$m?a:true()})

Although Saxon can't handle this, I believe it is permitted according to this rule. After the "?", an NCName is consistent with the EBNF but a QName containing a colon is not, so the longest token "consistent with the EBNF" is "a" rather than "a:true".

Any views on whether this is a correct interpretation of the rules?

Michael Kay
Saxonica

Received on Tuesday, 23 February 2016 09:56:23 UTC