Re: How is ambiguity defined?

> On 6,Jan2022, at 9:13 AM, Steven Pemberton <steven.pemberton@cwi.nl> wrote:
> 
> ...
> 
> Since we don't normatively demand any particular parsing algorithm, the only restrictions on the algorithm are therefore:
> * It must find at least one parse of any input that matches the grammar
> * if it finds more than one parse, it must report that fact.
> 
> So I think the fact that Michael and I report differently about
> 
>   s: a*; b*.
> 
> is covered, since we are using different algorithms.

I think I agree with the spirit of what Steven says here, but the
specifics of the wording worry me.  In the context of BNF, ambiguity
is a property of a given sentence with respect to a given grammar and
does not in any way depend on the parsing algorithm used.  So I’m
nervous about attributing the differences in our reports to
differences of parsing algorithms.

I think there are two sources of different behavior (at least — I have
no proof that this list is exhaustive) with respect to the reporting
of ambiguity: one is differences in interpreting the meaning of EBNF
rules and one is differences in the rewriting of EBNF to BNF.  

Grune and Jacobs seem to me to provide the best handle on the latter
issue.  I won’t repeat the long quotation I included in my mail of the
other day, but they distinguish an 'iterative' and a 'recursive' approach.

Under the iterative approach, a regular right part R is taken as
denoting a (possibly infinite) set of sequences of terminal and
non-terminal symbols -- namely, the set of sequences in L(R), which is
a subset of V*, where V is the union of the nonterminal and terminal
symbols of the grammar.  Any one of these right-hand sides can be used
in derivation, and derivation has its usual form and definition.

Consider the grammar G:

    S= 'a'*; 'b'*.

The iterative interpretation understands G's rule for S as a way of
writing the infinite BNF rule

    S = (); 'a'; 'b'; 'aa'; 'bb'; 'aaa'; 'bbb'; 'aaaa'; 'bbbb' ... 

To derive the empty string, we take the first alternative and rewrite
S as the empty string.  That gives us an empty sentential form, and
derivation is complete.  There is only one such derivation, and so on
this understanding of the meaning of EBNF, the sentence is not
ambiguous.

The other approach Grune and Jacob describe, they call the 'recursive'
approach because they are focusing on repetitions.

But more generally it might be called the 'invisible nonterminal'
approach, because it takes a regular-right-part grammar G as shorthand
for a grammar in which each repeat0, repeat1, option, nested choice,
etc., in G has been replaced with a new nonterminal (new in the sense
that it does not already occur in G), and appropriate definitions for
those new nonterminals have been added to the grammar's set of
production rules.  (Also, if we are being very careful, in which the
new nonterminals are added to the vocabulary.)  More generally, the
invisible-nonterminal approach takes an EBNF grammar G to be shorthand
for an equivalent BNF grammar G'.  In the ixml context, we have an
implicit proviso that every non-hidden nonterminal in G must reappear
in G', and must recognize the same language in G' as in G; this or
some similar constraint is necessary in order to guarantee that the
XML produced as output will be correct.

On the invisible-nonterminal approach as just described, the grammar
is understood as shorthand for some) equivalent BNF grammar, like for
example G2:

    S = a-star; b-star. 
    a-star = empty; 'a', a-star. 
    b-star = empty; 'b', b-star.
    empty = { nil } . 

But since we don't constrain the rewrite rules used by an
implementation, it could equally well turn into something different,
like G3:

    S = a-star; b-star. 
    a-star = empty; a-star, 'a'. 
    b-star = empty; b-star, 'b'. 
    empty = { nil } . 

Or G4:

    S = a-star; b-star. 
    a-star = empty; a-plus.
    b-star = empty; b-plus.
    a-plus = 'a'; 'a', a-plus. 
    b-plus = 'b'; 'b', b-plus. 
    empty = { nil } . 

Under any of these grammars, there are multiple left-most derivations
and multiple parse trees and the empty string is ambiguous.  But if we
take the more general form of the invisible-nonterminal approach --
any equivalent grammar that provides for the correct XML output --
then the grammar could also be rewritten as G5:

    S = empty; a-plus; b-plus.
    a-plus = 'a'; 'a', a-plus. 
    b-plus = 'b'; 'b', b-plus. 
    empty = { nil } . 

Note that G5 recognizes the same language as the original grammar G,
and also that each unhidden nonterminal in G (namely S) also
recognizes the same language in G5 as in G.

Parsed with G5, the empty string is not ambiguous.

We can conclude that the ambiguity found in G2, G3, and G4 here is not
required by the invisible-nonterminal approach, only by the particular
way in which we chose to rewrite G to get it into BNF.  If we think of
the original EBNF grammar as ambiguous (because G2 is ambiguous), then
the rewrite into G5 eliminated an ambiguity.  If we think of G as
unambiguous (as in fact I do), then the rewrites into G2, G3, and G4
introduced ambiguity.

In general, I think it is evident that there is no universally
accepted definition of derivation, or parse tree, or "a parse", for
EBNF grammars.  The formalists define those terms for BNF and say
nothing about EBNF, or only that for every EBNF grammar there is a BNF
grammar generating the same set of strings.  The variation in behavior
we have seen so far stems from the difference between the two
interpretations of EBNF notation identified by Grune and Jacobs.  But
further variation is possible within the invisible-nonterminal
interpretation, because there are many ways to rewrite an EBNF grammar
into an equivalent BNF grammar.

I am coming to think that the best we are going to be able to do is to
decide:

  (1) that the ixml spec should not / will not / does not prescribe
either the iterative or the recursive / invisible-nonterminal
interpretation of EBNF grammars, and

  (2) that the spec should not / will not / does not constrain the
rewritings used by processors taking the invisible-nonterminal
approach, other than that they must allow the processor to produce the
correct XML output, and

  (3) that a processor should / may / must report ambiguity if there
is more than one parse tree for the sentence, against the grammar
actually used for parsing.

Note that right now there is no requirement that the processor be able
to show the user how the grammar got rewritten, so the exact form of
the grammar actually used for parsing is not necessarily visible to
the user.

If we do make these decisions, I fear that it may challenge even
Steven's dexterity with prose to describe them clearly enough to hold
water without getting into details that will alarm end users. I hope
we can do better than

    For technical reasons not described here, different conforming
    processors MAY differ in whether they report a given sentence as
    ambiguous or not.

But that's the best I have managed so far.

 
Michael

Received on Thursday, 6 January 2022 18:43:29 UTC