- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Wed, 5 Jan 2022 12:43:59 -0700
- To: Norm Tovey-Walsh <norm@saxonica.com>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, ixml <public-ixml@w3.org>
On 5,Jan2022, at 5:35 AM, Norm Tovey-Walsh <norm@saxonica.com> wrote: > > Apologies for what may be a naive question. > > Consider this grammar: > > lnnl: "a", number, number, "a" . > -number: oddnumber ; evennumber . > -oddnumber: "1" ; . > -evennumber: "2" ; . > > The grammar is ambiguous in that there are multiple nonterminals that > can match nothing. > > But there is only one possible parse for “a12a”. I assume that we don’t > report any ambiguity in this case. > > There are several possible parses for “aa”, but they’ll all produce > exactly the same XML serialization. Does that count as ambiguous? On the hunch that this kind of question is best handled by defining or clarifying terms, I have just checked my shelf for definitions of ambiguity by people who know what they are doing. The short answer is: They all agree, and they leave us with the task of defining what ambiguity means for our spec. There is no off-the-shelf answer for our situation. Longer answer: First, how do they define ambiguity? György E. Révész, Introduction to formal languages (New York: McGraw-Hill, 1983, rpt. Mineola, NY: Dover, 1991) says (p. 138): For a context-free Grammar G, a word P in L(G) is said to be ambiguous iff P has two or more derivation trees (leftmost derivations) in G. John E. Hopcroft and Jeffrey D. Ulllman, Introduction to automata theory, languages, and computation (Reading, MA: Addison-Wesley, 19790, say (p. 87): A context-free grammar G such that some word has two parse trees is said to be *ambiguous*. From what we have said above, an equivalent definition of ambiguity is that some word has more than one leftmost derivation or more than one rightmost derivation. Dick Grune and Ceriel J. H. Jacobs, Parsing techniques: A practical guide (New York: Ellis Horwood, 1990, 2d ed. New York: Springer, 2008), say (p. 62 of 1st ed., p. 63 of 2d ed.) Not surprisingly, a sentence with more than one production tree is called *ambiguous*, but we must immediately distinguish between *essential ambigurity* and *spurious ambiguity*. ... An ambiguous sentence is spuriously ambiguous if all its production trees describe the same semantics; if some of them differ in their semantics, the ambiguity is essential. Some things seem worth noting. Révésc and Hopcroft/Ullman explicitly define ambiguity with respect to a grammar / string pair. This is not surprising, since a sentence may have multiple parse trees for one grammar and only one parse tree for a different grammar that defines the same language. All three books define ambiguity in terms of trees, not directly of derivations. All three books treat trees as effectively a visualization of a derivation, with one node for each derivation step. In their accounts, any derivation gives rise to exactly one tree, any tree has one leftmost and one rightmost derivation, and in every derivation step exactly one nonterminal is rewritten by being replaced by its right-hand side. I won't quote the wording, but in all three books, derivation or production graphs are described as using nonterminals to label nodes. The problem is that for us, these don't all hold. I'll take the example that came up the other day, since it's shorter than Norm's example: S = 'a'* | 'b'*. The empty string '' is a sentence in this language. Is the empty string ambiguous under this grammar? How many ways are there to derive it? How many parse trees does it have? We have not defined how to derive a string from an ixml grammar, and the answers will vary depending on what we decide. Approach A: BNF One approach is to rewrite the ixml grammar to BNF and use what the formal-language theorists give us. If we use the rewrites in the spec, we get S = a-star; b-star. -a-star: 'a', a-star; . -b-star: 'b', b-star; . and this is clearly ambiguous: we can derive the empty string either via a-star or via b-star. However, this is not the grammar we started with, and ambiguity is not defined with respect to the set of all possible grammars for a language, but with respect to one particular grammar. Approach B: EBNF (derivation by nonterminals) If parse trees are labeled with nonterminals, then there is only one parse tree for the empty string in this grammar: <S></S>. There is no nonterminal defined as 'a'*, no nonterminal for 'b'*, so they don't go into the parse tree. Approach C: EBNF (derivation by expressions) We could say there are obviously two ways to derive the sentence in the EBNF: 1 S 2 'a'* 3 '' and another is 1 S 2 'b'* 3 '' The parse trees for this grammar would, on this account, be labeled not with nonterminals but with expressions of any kind -- here a repeat-0 over a terminal symbol. The problem is that neither we nor anyone else have offered a formal definition of derivation from an EBNF grammar. So we could equally well take a different approach. In their section 2.3.2.4 (p. 28 of 2d ed; 2.3.2.3 on p. 34 of 1st ed.), Grune and Jacobs describe our problem concisely without resolving it in a way that binds us. There are two different schools of thought about the structural meaning of a regular right-hand side. One school maintains that a rule like Book --> Preface Chapter+ Conclusion is an abbreviation of Book --> Preface α Conclusion α --> Chapter | Chapter α as shown above. This is the "(right)recursive" interpretation. It has the advantage that it is easy to explain and that the transformatoin to "normal" CF is simple. Disadvantages are that the transformation entails anonymous rules (identified by α here) and that the lopsided production tree, for, for example, a book of four chapters does not correspond to our idea of the structure of the book; ... The second school claims that Book --> Preface Chapter+ Conclusion is an abbreviation of Book --> Preface Chapter Conclusion | Preface Chapter Chapter Conclusion | Preface Chapter Chapter Chapter Conclusion | ... ... This is the "iterative" interpretation. It has the advantage that it yields a beautiful production tree ..., but the disadvantages are that it involves an infinite number of production rules and that the nodes in the production tree have a varying fan-out. Since the implementation of the iterative interpretation is far from trivial, most practical parser generators use the recursive interpretation in some form or another, whereas most research has been done on the iterative interpretation. Hot dog: when I figured out how to apply the Earley algorithm directly to EBNF grammars I apparently solved a non-trivial problem. Cool! It will be seen that approaches A and C described above take the recursive interpretation, and B the iterative interpretation. For implementors, it looks to me as if it's easiest to detect ambiguities by looking at what might be called the raw parse tree: the parse tree you get if you ignore all marks. One complication is that since some implementations rewrite the grammar (and this is implicitly allowed by our Hints to Implementors section), not all processors are going to produce the same raw parse tree. For the user, I think Grune and Jacobs's disinction between essential ambiguity and spurious ambiguity is likely to be the distinction between ambiguities that produce more than one XML form and ambiguities that don't. What tradeoff do we want between simplicity for implementors and simplicity for users? At the moment, every answer I see to Norm's question looks unacceptable. Michael
Received on Wednesday, 5 January 2022 20:59:54 UTC