Re: How is ambiguity defined? from C. M. Sperberg-McQueen on 2022-01-05 (public-ixml@w3.org from January 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Wed, 5 Jan 2022 12:43:59 -0700
To: Norm Tovey-Walsh <norm@saxonica.com>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, ixml <public-ixml@w3.org>
Message-Id: <5826A01A-B59F-41AE-BE4A-9BDE1AF2E8E7@blackmesatech.com>
On 5,Jan2022, at 5:35 AM, Norm Tovey-Walsh <norm@saxonica.com> wrote:
> 
> Apologies for what may be a naive question.
> 
> Consider this grammar:
> 
> lnnl: "a", number, number, "a" .
> -number: oddnumber ; evennumber .
> -oddnumber: "1" ; .
> -evennumber: "2" ; .
> 
> The grammar is ambiguous in that there are multiple nonterminals that
> can match nothing.
> 
> But there is only one possible parse for “a12a”. I assume that we don’t
> report any ambiguity in this case.
> 
> There are several possible parses for “aa”, but they’ll all produce
> exactly the same XML serialization. Does that count as ambiguous?

On the hunch that this kind of question is best handled by defining or
clarifying terms, I have just checked my shelf for definitions of
ambiguity by people who know what they are doing.  

The short answer is: They all agree, and they leave us with the task
of defining what ambiguity means for our spec.  There is no
off-the-shelf answer for our situation.

Longer answer:

First, how do they define ambiguity?

György E. Révész, Introduction to formal languages (New York:
McGraw-Hill, 1983, rpt. Mineola, NY: Dover, 1991) says (p. 138):

    For a context-free Grammar G, a word P in L(G) is said to be
    ambiguous iff P has two or more derivation trees (leftmost
    derivations) in G.

John E. Hopcroft and Jeffrey D. Ulllman, Introduction to automata
theory, languages, and computation (Reading, MA:  Addison-Wesley,
19790, say (p. 87):

    A context-free grammar G such that some word has two parse trees
    is said to be *ambiguous*.  From what we have said above, an
    equivalent definition of ambiguity is that some word has more than
    one leftmost derivation or more than one rightmost derivation.

Dick Grune and Ceriel J. H. Jacobs, Parsing techniques: A practical
guide (New York: Ellis Horwood, 1990, 2d ed. New York: Springer,
2008), say (p. 62 of 1st ed., p. 63 of 2d ed.)

   Not surprisingly, a sentence with more than one production tree is
   called *ambiguous*, but we must immediately distinguish between
   *essential ambigurity* and *spurious ambiguity*. ... An ambiguous
   sentence is spuriously ambiguous if all its production trees
   describe the same semantics; if some of them differ in their
   semantics, the ambiguity is essential.

Some things seem worth noting.

Révésc and Hopcroft/Ullman explicitly define ambiguity with respect to
a grammar / string pair.  This is not surprising, since a sentence may
have multiple parse trees for one grammar and only one parse tree for
a different grammar that defines the same language.

All three books define ambiguity in terms of trees, not directly of
derivations.

All three books treat trees as effectively a visualization of a
derivation, with one node for each derivation step.  In their
accounts, any derivation gives rise to exactly one tree, any tree has
one leftmost and one rightmost derivation, and in every derivation
step exactly one nonterminal is rewritten by being replaced by its
right-hand side.

I won't quote the wording, but in all three books, derivation or
production graphs are described as using nonterminals to label nodes.

The problem is that for us, these don't all hold.  I'll take the
example that came up the other day, since it's shorter than Norm's
example:

    S = 'a'* | 'b'*.

The empty string '' is a sentence in this language.  Is the empty
string ambiguous under this grammar?  How many ways are there to
derive it?  How many parse trees does it have?

We have not defined how to derive a string from an ixml grammar, and
the answers will vary depending on what we decide.

Approach A:  BNF

One approach is to rewrite the ixml grammar to BNF and use what the
formal-language theorists give us.  If we use the rewrites in the
spec, we get

   S = a-star; b-star.
   -a-star: 'a', a-star; . 
   -b-star: 'b', b-star; . 

and this is clearly ambiguous: we can derive the empty string either
via a-star or via b-star.

However, this is not the grammar we started with, and ambiguity is not
defined with respect to the set of all possible grammars for a
language, but with respect to one particular grammar.


Approach B:  EBNF (derivation by nonterminals)

If parse trees are labeled with nonterminals, then there is only one
parse tree for the empty string in this grammar: <S></S>.  There is no
nonterminal defined as 'a'*, no nonterminal for 'b'*, so they don't go
into the parse tree.


Approach C:  EBNF (derivation by expressions)

We could say there are obviously two ways to derive the sentence in
the EBNF:

  1 S
  2 'a'*
  3 ''

and another is 

  1 S
  2 'b'*
  3 ''

The parse trees for this grammar would, on this account, be labeled
not with nonterminals but with expressions of any kind -- here a
repeat-0 over a terminal symbol.

The problem is that neither we nor anyone else have offered a formal
definition of derivation from an EBNF grammar.  So we could equally
well take a different approach.


In their section 2.3.2.4 (p. 28 of 2d ed; 2.3.2.3 on p. 34 of 1st
ed.), Grune and Jacobs describe our problem concisely without
resolving it in a way that binds us.

    There are two different schools of thought about the structural
    meaning of a regular right-hand side.  One school maintains that a
    rule like

        Book --> Preface Chapter+ Conclusion

    is an abbreviation of

        Book --> Preface α Conclusion
 α --> Chapter | Chapter α

    as shown above.  This is the "(right)recursive" interpretation.
    It has the advantage that it is easy to explain and that the
    transformatoin to "normal" CF is simple.  Disadvantages are that
    the transformation entails anonymous rules (identified by α here)
    and that the lopsided production tree, for, for example, a book of
    four chapters does not correspond to our idea of the structure of
    the book; ...

    The second school claims that

        Book --> Preface Chapter+ Conclusion

    is an abbreviation of

        Book --> Preface Chapter Conclusion
           | Preface Chapter Chapter Conclusion 
           | Preface Chapter Chapter Chapter Conclusion 
           | ...
   ...

    This is the "iterative" interpretation.  It has the advantage that
    it yields a beautiful production tree ..., but the disadvantages
    are that it involves an infinite number of production rules and
    that the nodes in the production tree have a varying fan-out.

    Since the implementation of the iterative interpretation is far
    from trivial, most practical parser generators use the recursive
    interpretation in some form or another, whereas most research has
    been done on the iterative interpretation.

Hot dog: when I figured out how to apply the Earley algorithm directly
to EBNF grammars I apparently solved a non-trivial problem.  Cool!

It will be seen that approaches A and C described above take the
recursive interpretation, and B the iterative interpretation.

For implementors, it looks to me as if it's easiest to detect
ambiguities by looking at what might be called the raw parse tree:
the parse tree you get if you ignore all marks.  One complication is
that since some implementations rewrite the grammar (and this is
implicitly allowed by our Hints to Implementors section), not all
processors are going to produce the same raw parse tree.

For the user, I think Grune and Jacobs's disinction between essential
ambiguity and spurious ambiguity is likely to be the distinction
between ambiguities that produce more than one XML form and
ambiguities that don't.

What tradeoff do we want between simplicity for implementors
and simplicity for users?


At the moment, every answer I see to Norm's question looks
unacceptable.

Michael
Received on Wednesday, 5 January 2022 20:59:54 UTC