Re: Side issue - ambiguity from C. M. Sperberg-McQueen on 2022-06-21 (public-ixml@w3.org from June 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Tue, 21 Jun 2022 15:02:03 -0600
To: Steven Pemberton <steven.pemberton@cwi.nl>
Cc: public-ixml@w3.org
Message-ID: <87wnd9yblg.fsf@blackmesatech.com>
Responses to Steven interspersed.

Steven Pemberton writes:

>> G1 follows the pattern of the test case that led us to realize that
>> different processors produce different results in cases of ambiguity.
>> Accepting the empty string can be justified by appeal to either the left
>> or the right choice, so it feels inherently ambiguous to some. On the
>> other hand, there is only one parse tree, namely <S/>.

> Here I disagree: there are two parse trees that are identical, and my
> implementation records both.

Hmm.  I am not sure I can think of a definition of 'parse tree' that
would make that sentence make sense.

>> ...

>> {G4} S = A; B. -A = 'a'*. -B = 'b'*. {input: empty string}
>>
>> G4 highlights a thorny issue. The base grammar, ignoring the marks on
>> the symbols, is the same as G2. But while there are two 'raw' parse
>> trees (if I can use that term for a parse tree with a node for every
>> nonterminal regardless of its marking), there is only one result tree,
>> because the 'A' and 'B' nonterminals are marked hidden.
>>
>> Should that count as ambiguous because there are multiple (raw) parse
>> trees, or as unambiguous because there is only one XML result tree?

> We've talked about this in the past. Deciding if two serialisations
> are identical is a *lot* of work. And you may have hundreds to
> compare.
> Ambiguity is in the parse-tree, and has nothing to do with the
> serialisation, and we say that too.

I agree that we have talked about this and have decided not to define
ambiguity as meaning the existence of more than one XML serialization
for the given sentence and grammar.

But I'm not sure it's completely clear to all readers that the spec
distinguishes decisively between the parse tree and the serialized XML.
To be very concrete, it's not clear to me.  Perhaps I am not reading the
spec attentively enough, but it seems to me that the spec rather
encourages the reader to assume that they are effectively the same
thing, or stand in a 1:1 relation of abstract object and concrete
representation of that object.  The first occurrence of the word 'parse'
in the spec is in the section "How it works", which reads in part

    An input is parsed using this grammar, and the resulting parse tree
    is serialized as XML.

That has always seemed to me to be using the term 'parse tree' to denote
the XML form of the result, or more precisely to denote an abstract
structure isomorphic to the XML, and thus to be conflating the two
notions which some people distinguish as the (raw) parse tree and an
'abstract syntax tree' created by trimming the parse tree.  I attribute
it to the desire to make the spec easily readable and not to put up
barriers to comprehension by stressing complications and distinctions
that will be unhelpful to most readers.  The words "the resulting parse
tree is serialized as XML" does not make me think that the spec is
describing the relation of parse tree and AST.

If the form of words "X is serialized as XML" is intended to suggest
that X is or can be structurally different from the XML produced, then I
regret to say that it does not suggest it to this reader.


>> For what it's worth, our earlier discussion of this topic led me to
>> believe that the core of our difficulty is that formal definitions of
>> ambiguity refer to grammars in a particular form (which I am somewhat
>> loosely calling 'BNF'), and are not directly applicable to grammars in
>> EBNF.

> I don't believe we really have a difficulty.

Well, views may differ of course.

But we have been unable to reach a consensus on the meaning, in an ixml
context, of the term "ambiguity", or even a consensus that it would be
worth any effort to achieve consensus.  And in consequence we have been
forced to abandon the idea that different processors would produce the
same results with respect to the detection of ambiguity.  The end result
is that our spec continues to appeal to a notion of 'ambiguity' without
in fact providing a definition that can be reliably applied.

That seems to me to have been a difficulty, but perhaps I have
unrealistic expectations.

> An implementer told me he
> uses PEG grammars, which are allowable according to the spec because
> we say that the implementation only has to produce at least one parse
> of acceptable input, but PEGs never find ambiguities because they only
> produce one parse. We require an implementation to report ambiguities
> if it finds them, but that's all. 

Strictly speaking, I believe we ALLOW implementations to report
ambiguity, and the rhetoric of the spec does its best to encourage it.
But that's a side issue here, I think.

More salient is this: do PEG grammars define the same language as BNF
grammars which look the same?

I have not read anything about PEG grammars that made me eager to spend
time learning more, but by all the reports I have seen, a PEG parser
presented in the input "abc" and the grammar

   S = (B; A), C.
   A = 'a'.
   B = 'ab'.
   C = 'c'.

will report a parse of <S><B>ab</B><C>c</C></S>, while the same parser
presented with the same input and the grammar

   S = (A; B), C.
   A = 'a'.
   B = 'ab'.
   C = 'c'.

will report that the input does not match the grammar.

I believe that the reason is that PEG parsing allows the parser to
commit to the 'A' once it has been successfully parsed and that the
consequence of that approach is that PEG parsers may overlook parses
that exist, not just in cases of ambiguity but in cases where the PEG
parser goes down a blind alley, and its commitment to its choices means
it cannot find its way back.  More broadly, for any two nonterminals A
and B, in an ixml grammar the rule "N = A; B" and the rule "N = B; A"
define the same language, and for PEG parsers they do not.

If my understanding is correct, PEG parsers will sometimes not find any
parse, even when one exists.

If my understanding is incorrect, I will be glad of correction by those
with firmer knowledge of PEG parsing than can be gleaned from the
Wikipedia article.

>> If I understand the definitions in my formal-languages books correctly,
>> G6 and G7 define the same language, and the sentence 'p' is ambiguous
>> against grammar G7 but not against grammar G6.

> "Defining the same language" and "not producing ambiguous parses" are
> two different things.

Err, yes.  I am not sure how they could fail to be two different things,
since one is a predicate that applies to two or more grammars and the
other to a single grammar.

> Example:
>
>    A: A, 'a'; 'a', A; .
>
> defines the language 'a'* in a myriad of parses.

Yes.

The point I was trying to make (apparently without as much success as I
would have hoped) is that a sentence is or is not ambiguous with respect
to a given grammar, not with respect to a given language.

That means that when an ixml processor replaces the user-supplied
grammar G with an equivalent grammar G', claims about the ambiguity of
the input based on parsing it with G' are not guaranteed to say anything
reliable about the ambiguity of the input with respect to the original
grammar G.

In a spec which explicitly foresees that a processor may replace the
user-specified grammar with an equivalent implementation-dependent
grammar, it will therefore always be impossible to provide an
interoperable definition of ambiguity.

>> - It's not ambiguous against G6 because formally a grammar has a set
>> of productions of the form A --> alpha, where A is a nonterminal
>> symbol and alpha is a string of symbols from (N union T)*. If we
>> take ['a'-'z'] as a terminal symbol (which ixml tells us to do),
>> then the left and right sides of the choice do not represent two
>> different production rules but only one. So from a formal point of
>> view, G6 is just a slightly eccentric way of writing G8:
>>
>> {G8} name = ['a'-'z']. {input: 'p'}
>>
>> I would rather not require that processors be required to detect the
>> identity of the two sides of the choice in G6. I would also not like to
>> forbid them to do so.
>>
>> {G9} name = A; A. A = ['a'-'z']. {input: 'p'}
>>
>> G9 illustrates the same principle on the level of nonterminals;
>> formally, there is only one right-hand side for 'name', and the fact
>> that it is written twice makes no more difference than writing the set
>> of numbers from one to three as {1, 2, 3, 3} with four not three
>> expressions within the braces.

> It depends whether you consider the alternatives as a set or a bag. I
> consider them as a bag, and so don't accept your premise.

You appear, then, to be alone among the practitioners of formal language
and automata theory.

I will stand by my claim that in any formal treatment of grammars, the
production rules of a grammar are sets, not bags, of sequences.  It
follows that from the formal point of view taken by (say) Hopcroft and
Ullman, or Greibach, or Chomsky, or any textbook on automata theory and
formal languages, the string "name = A; A." and the string "name = A."
describe the same production rule for name.

>> There is only one parse tree here (<name><A>p</A></name>) regardless of
>> which branch of the choice one takes.
>>
>> {G10} name = ['a'-'z']+; ['a'-'z'; 'A' - 'Z']+. {input: 'p'}
>>
>> {G11} name = A; B. A = ['a'-'z']+. B = ['a'-'z'; 'A' -
>>   'Z']+. {input: 'p'}
>>
>> Grammars G10 and G11 illustrate the proposition that it's not *just* a
>> question of identifying identical right-hand sides: 'p' matches both
>> left and right sides of the choice, but they are not identical.

> And I would call that an ambiguity.

That seems to be consistent with the intuitions you have described thus
far.  You will be unsurprised, I guess, to learn that I think 'p' is
most plausibly described as ambiguous against G11 (where there are two
different parse trees for it) but not against G10 (where there is only
one).

>> The reason I wrote this note was to ask a question of those whose
>> intuition is that the inputs specified should be regarded as ambiguous
>> against G1, G6, and G10. Consider G12:
>>
>> {G12} name = ['a'-'z'; 'a'-'z'; 'A' - 'Z']+. {input: 'p'}
>>
>> Here there is only one terminal (as ixml defines the term), not two.
>>
>> Is your intuition that the input 'p' is ambiguous against grammar G12,
>> or unambiguous?

> Unambiguous. We call those things [ ] a set.

Thank you; that's very helpful, although it seems to me to diverge
strikingly from your intuitions with respect to G10.

> (Your mails take a loooong time to answer ;-); this one about an hour,
> including a lot of pacing up and down thinking)

Thank you very much for taking the time to answer.

Michael


-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
Received on Tuesday, 21 June 2022 21:02:21 UTC