Re: Side issue - ambiguity from Steven Pemberton on 2022-06-21 (public-ixml@w3.org from June 2022)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Tue, 21 Jun 2022 13:47:53 +0000
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, public-ixml@w3.org
Message-Id: <1655817561083.3507719345.1152846774@cwi.nl>
 > G1 follows the pattern of the test case that led us to realize that
 > different processors produce different results in cases of ambiguity.
 > Accepting the empty string can be justified by appeal to either the left
 > or the right choice, so it feels inherently ambiguous to some. On the
 > other hand, there is only one parse tree, namely <S/>.
Here I disagree: there are two parse trees that are identical, and my 
implementation records both.

 > {G2} S = A; B. A = 'a'*. B = 'b'*. {input: empty string}
 >
 > {G3} S = A; B. A = ; 'a', A. B = ; 'b', B. {input: empty string}
 >
 > G2 gives names to the left and right choices in the rule for S. G3 goes
 > further an rewrites the grammar as BNF; it is the BNF to which an
 > implementation might choose to reduce G1 or G2. For both of these
 > grammars, there are two parse trees for the empty string: <S><A/></S>
 > and <S><B/></S>. So the sentence is clearly ambiguous.

I think 'unarguably' is a better adjective ;-)

 > {G4} S = A; B. -A = 'a'*. -B = 'b'*. {input: empty string}
 >
 > G4 highlights a thorny issue. The base grammar, ignoring the marks on
 > the symbols, is the same as G2. But while there are two 'raw' parse
 > trees (if I can use that term for a parse tree with a node for every
 > nonterminal regardless of its marking), there is only one result tree,
 > because the 'A' and 'B' nonterminals are marked hidden.
 >
 > Should that count as ambiguous because there are multiple (raw) parse
 > trees, or as unambiguous because there is only one XML result tree?

We've talked about this in the past. Deciding if two serialisations are 
identical is a *lot* of work. And you may have hundreds to compare.
Ambiguity is in the parse-tree, and has nothing to do with the 
serialisation, and we say that too.

 > For what it's worth, our earlier discussion of this topic led me to
 > believe that the core of our difficulty is that formal definitions of
 > ambiguity refer to grammars in a particular form (which I am somewhat
 > loosely calling 'BNF'), and are not directly applicable to grammars in
 > EBNF.

I don't believe we really have a difficulty. An implementer told me he uses 
PEG grammars, which are allowable according to the spec because we say that 
the implementation only has to produce at least one parse of acceptable 
input, but PEGs never find ambiguities because they only produce one parse. 
We require an implementation to report ambiguities if it finds them, but 
that's all. 

 > The spec clearly allows implementations to parse the input by rewriting
 > the grammar from EBNF to BNF, but (correctly, I think) refrains from
 > prescribing a particular translation to BNF. Different ways of reducing
 > a grammar to an equivalent BNF may differ in whether the sentence is
 > ambiguous or not. G5, for example, is a BNF equivalent to G1 for which
 > the empty string is not ambiguous:
 >
 > {G5} S = nil; As; Bs. -nil=. -As='a'; As, 'a'. -Bs='b'; Bs, 'b'.
 >
 > The fact that ambiguity is not crisply defined for EBNF certainly
 > complicates the matter. But there appears to be more to it. Consider
 > the next two grammars.
 >
 > {G6} name = ['a'-'z']; ['a'-'z']. {input: 'p'}
 >
 > {G7} name = A; B. A = ['a'-'z']. B = ['a'-'z']. {input: 'p'}
 >
 > G6 and G7 seem to illustrate the same points as G1 and G2, but they are
 > already in BNF form, so whatever problem they raise does not seem to
 > have anything to do with the definition of ambiguity for EBNF.
 >
 > If I understand the definitions in my formal-languages books correctly,
 > G6 and G7 define the same language, and the sentence 'p' is ambiguous
 > against grammar G7 but not against grammar G6.

"Defining the same language" and "not producing ambiguous parses" are two 
different things.
Example:

    A: A, 'a'; 'a', A; .

defines the language 'a'* in a myriad of parses.

 > - It's not ambiguous against G6 because formally a grammar has a set
 > of productions of the form A --> alpha, where A is a nonterminal
 > symbol and alpha is a string of symbols from (N union T)*. If we
 > take ['a'-'z'] as a terminal symbol (which ixml tells us to do),
 > then the left and right sides of the choice do not represent two
 > different production rules but only one. So from a formal point of
 > view, G6 is just a slightly eccentric way of writing G8:
 >
 > {G8} name = ['a'-'z']. {input: 'p'}
 >
 > I would rather not require that processors be required to detect the
 > identity of the two sides of the choice in G6. I would also not like to
 > forbid them to do so.
 >
 > {G9} name = A; A. A = ['a'-'z']. {input: 'p'}
 >
 > G9 illustrates the same principle on the level of nonterminals;
 > formally, there is only one right-hand side for 'name', and the fact
 > that it is written twice makes no more difference than writing the set
 > of numbers from one to three as {1, 2, 3, 3} with four not three
 > expressions within the braces.

It depends whether you consider the alternatives as a set or a bag. I 
consider them as a bag, and so don't accept your premise.

 > There is only one parse tree here (<name><A>p</A></name>) regardless of
 > which branch of the choice one takes.
 >
 > {G10} name = ['a'-'z']+; ['a'-'z'; 'A' - 'Z']+. {input: 'p'}
 >
 > {G11} name = A; B. A = ['a'-'z']+. B = ['a'-'z'; 'A' - 'Z']+. {input: 
'p'}
 >
 > Grammars G10 and G11 illustrate the proposition that it's not *just* a
 > question of identifying identical right-hand sides: 'p' matches both
 > left and right sides of the choice, but they are not identical.

And I would call that an ambiguity.

 > G10 also illustrates the fact that for formal language theory,
 > ['a'-'z'] is not in fact a terminal symbol but just a shorthand for a
 > twenty-six way choice. Formally, 'name' is defined here as
 >
 > name = 'a'.
 > name = 'b'.
 > ...
 > name = 'z'.
 > name = 'A'.
 > ...
 > name = 'Z'.
 >
 > So the input 'p' is not ambiguous against G10, from a formal point of
 > view.
 >
 > The reason I wrote this note was to ask a question of those whose
 > intuition is that the inputs specified should be regarded as ambiguous
 > against G1, G6, and G10. Consider G12:
 >
 > {G12} name = ['a'-'z'; 'a'-'z'; 'A' - 'Z']+. {input: 'p'}
 >
 > Here there is only one terminal (as ixml defines the term), not two.
 >
 > Is your intuition that the input 'p' is ambiguous against grammar G12,
 > or unambiguous?

Unambiguous. We call those things [ ] a set.

Steven
(Your mails take a loooong time to answer ;-); this one about an hour, 
including a lot of pacing up and down thinking)
Received on Tuesday, 21 June 2022 13:48:24 UTC