- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Thu, 6 Jan 2022 11:43:07 -0700
- To: Steven Pemberton <steven.pemberton@cwi.nl>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, Tomos Hillman <yamahito@gmail.com>, Norm Tovey-Walsh <norm@saxonica.com>, ixml <public-ixml@w3.org>
> On 6,Jan2022, at 9:13 AM, Steven Pemberton <steven.pemberton@cwi.nl> wrote: > > ... > > Since we don't normatively demand any particular parsing algorithm, the only restrictions on the algorithm are therefore: > * It must find at least one parse of any input that matches the grammar > * if it finds more than one parse, it must report that fact. > > So I think the fact that Michael and I report differently about > > s: a*; b*. > > is covered, since we are using different algorithms. I think I agree with the spirit of what Steven says here, but the specifics of the wording worry me. In the context of BNF, ambiguity is a property of a given sentence with respect to a given grammar and does not in any way depend on the parsing algorithm used. So I’m nervous about attributing the differences in our reports to differences of parsing algorithms. I think there are two sources of different behavior (at least — I have no proof that this list is exhaustive) with respect to the reporting of ambiguity: one is differences in interpreting the meaning of EBNF rules and one is differences in the rewriting of EBNF to BNF. Grune and Jacobs seem to me to provide the best handle on the latter issue. I won’t repeat the long quotation I included in my mail of the other day, but they distinguish an 'iterative' and a 'recursive' approach. Under the iterative approach, a regular right part R is taken as denoting a (possibly infinite) set of sequences of terminal and non-terminal symbols -- namely, the set of sequences in L(R), which is a subset of V*, where V is the union of the nonterminal and terminal symbols of the grammar. Any one of these right-hand sides can be used in derivation, and derivation has its usual form and definition. Consider the grammar G: S= 'a'*; 'b'*. The iterative interpretation understands G's rule for S as a way of writing the infinite BNF rule S = (); 'a'; 'b'; 'aa'; 'bb'; 'aaa'; 'bbb'; 'aaaa'; 'bbbb' ... To derive the empty string, we take the first alternative and rewrite S as the empty string. That gives us an empty sentential form, and derivation is complete. There is only one such derivation, and so on this understanding of the meaning of EBNF, the sentence is not ambiguous. The other approach Grune and Jacob describe, they call the 'recursive' approach because they are focusing on repetitions. But more generally it might be called the 'invisible nonterminal' approach, because it takes a regular-right-part grammar G as shorthand for a grammar in which each repeat0, repeat1, option, nested choice, etc., in G has been replaced with a new nonterminal (new in the sense that it does not already occur in G), and appropriate definitions for those new nonterminals have been added to the grammar's set of production rules. (Also, if we are being very careful, in which the new nonterminals are added to the vocabulary.) More generally, the invisible-nonterminal approach takes an EBNF grammar G to be shorthand for an equivalent BNF grammar G'. In the ixml context, we have an implicit proviso that every non-hidden nonterminal in G must reappear in G', and must recognize the same language in G' as in G; this or some similar constraint is necessary in order to guarantee that the XML produced as output will be correct. On the invisible-nonterminal approach as just described, the grammar is understood as shorthand for some) equivalent BNF grammar, like for example G2: S = a-star; b-star. a-star = empty; 'a', a-star. b-star = empty; 'b', b-star. empty = { nil } . But since we don't constrain the rewrite rules used by an implementation, it could equally well turn into something different, like G3: S = a-star; b-star. a-star = empty; a-star, 'a'. b-star = empty; b-star, 'b'. empty = { nil } . Or G4: S = a-star; b-star. a-star = empty; a-plus. b-star = empty; b-plus. a-plus = 'a'; 'a', a-plus. b-plus = 'b'; 'b', b-plus. empty = { nil } . Under any of these grammars, there are multiple left-most derivations and multiple parse trees and the empty string is ambiguous. But if we take the more general form of the invisible-nonterminal approach -- any equivalent grammar that provides for the correct XML output -- then the grammar could also be rewritten as G5: S = empty; a-plus; b-plus. a-plus = 'a'; 'a', a-plus. b-plus = 'b'; 'b', b-plus. empty = { nil } . Note that G5 recognizes the same language as the original grammar G, and also that each unhidden nonterminal in G (namely S) also recognizes the same language in G5 as in G. Parsed with G5, the empty string is not ambiguous. We can conclude that the ambiguity found in G2, G3, and G4 here is not required by the invisible-nonterminal approach, only by the particular way in which we chose to rewrite G to get it into BNF. If we think of the original EBNF grammar as ambiguous (because G2 is ambiguous), then the rewrite into G5 eliminated an ambiguity. If we think of G as unambiguous (as in fact I do), then the rewrites into G2, G3, and G4 introduced ambiguity. In general, I think it is evident that there is no universally accepted definition of derivation, or parse tree, or "a parse", for EBNF grammars. The formalists define those terms for BNF and say nothing about EBNF, or only that for every EBNF grammar there is a BNF grammar generating the same set of strings. The variation in behavior we have seen so far stems from the difference between the two interpretations of EBNF notation identified by Grune and Jacobs. But further variation is possible within the invisible-nonterminal interpretation, because there are many ways to rewrite an EBNF grammar into an equivalent BNF grammar. I am coming to think that the best we are going to be able to do is to decide: (1) that the ixml spec should not / will not / does not prescribe either the iterative or the recursive / invisible-nonterminal interpretation of EBNF grammars, and (2) that the spec should not / will not / does not constrain the rewritings used by processors taking the invisible-nonterminal approach, other than that they must allow the processor to produce the correct XML output, and (3) that a processor should / may / must report ambiguity if there is more than one parse tree for the sentence, against the grammar actually used for parsing. Note that right now there is no requirement that the processor be able to show the user how the grammar got rewritten, so the exact form of the grammar actually used for parsing is not necessarily visible to the user. If we do make these decisions, I fear that it may challenge even Steven's dexterity with prose to describe them clearly enough to hold water without getting into details that will alarm end users. I hope we can do better than For technical reasons not described here, different conforming processors MAY differ in whether they report a given sentence as ambiguous or not. But that's the best I have managed so far. Michael
Received on Thursday, 6 January 2022 18:43:29 UTC