Side issue - ambiguity from C. M. Sperberg-McQueen on 2022-06-12 (public-ixml@w3.org from June 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Sun, 12 Jun 2022 13:07:07 -0600
To: public-ixml@w3.org
Message-ID: <87sfo94s6c.fsf@blackmesatech.com>
This is not a spec-related issue any more, since the spec is explicit
that conforming processors are allowed to vary in their flagging of
ambiguity -- but because I am interested in parsing I continue wishing
to understand the views of other group members on the topic, whether
they are illustrated concretely by the behavior of a processor or just
intuitions.

Consider the following grammars and input strings:

  {G1} S = 'a'*; 'b'*.  {input: empty string}

G1 follows the pattern of the test case that led us to realize that
different processors produce different results in cases of ambiguity.
Accepting the empty string can be justified by appeal to either the left
or the right choice, so it feels inherently ambiguous to some.  On the
other hand, there is only one parse tree, namely <S/>.

  {G2} S = A; B. A = 'a'*. B = 'b'*.  {input: empty string}

  {G3} S = A; B. A = ; 'a', A. B = ; 'b', B.  {input: empty string}

G2 gives names to the left and right choices in the rule for S.  G3 goes
further an rewrites the grammar as BNF; it is the BNF to which an
implementation might choose to reduce G1 or G2.  For both of these
grammars, there are two parse trees for the empty string: <S><A/></S>
and <S><B/></S>.  So the sentence is clearly ambiguous.

  {G4} S = A; B. -A = 'a'*. -B = 'b'*.  {input: empty string}

G4 highlights a thorny issue.  The base grammar, ignoring the marks on
the symbols, is the same as G2.  But while there are two 'raw' parse
trees (if I can use that term for a parse tree with a node for every
nonterminal regardless of its marking), there is only one result tree,
because the 'A' and 'B' nonterminals are marked hidden.

Should that count as ambiguous because there are multiple (raw) parse
trees, or as unambiguous because there is only one XML result tree?

As a user, I lean toward the latter (easier to understand), as an
implementor toward the former (easier to implement).

For what it's worth, our earlier discussion of this topic led me to
believe that the core of our difficulty is that formal definitions of
ambiguity refer to grammars in a particular form (which I am somewhat
loosely calling 'BNF'), and are not directly applicable to grammars in
EBNF.

The spec clearly allows implementations to parse the input by rewriting
the grammar from EBNF to BNF, but (correctly, I think) refrains from
prescribing a particular translation to BNF.  Different ways of reducing
a grammar to an equivalent BNF may differ in whether the sentence is
ambiguous or not.  G5, for example, is a BNF equivalent to G1 for which
the empty string is not ambiguous:

  {G5} S = nil; As; Bs. -nil=. -As='a'; As, 'a'. -Bs='b'; Bs, 'b'.

The fact that ambiguity is not crisply defined for EBNF certainly
complicates the matter.  But there appears to be more to it.  Consider
the next two grammars.

  {G6} name = ['a'-'z']; ['a'-'z'].  {input: 'p'}

  {G7} name = A; B. A = ['a'-'z']. B = ['a'-'z'].  {input: 'p'}

G6 and G7 seem to illustrate the same points as G1 and G2, but they are
already in BNF form, so whatever problem they raise does not seem to
have anything to do with the definition of ambiguity for EBNF.

If I understand the definitions in my formal-languages books correctly,
G6 and G7 define the same language, and the sentence 'p' is ambiguous
against grammar G7 but not against grammar G6.

  - It's ambiguous against G7 because there are two parse trees, namely
    <name><A>p</A></name> and <name><B>p</B></name>.

  - It's not ambiguous against G6 because formally a grammar has a set
    of productions of the form A --> alpha, where A is a nonterminal
    symbol and alpha is a string of symbols from (N union T)*.  If we
    take ['a'-'z'] as a terminal symbol (which ixml tells us to do),
    then the left and right sides of the choice do not represent two
    different production rules but only one.  So from a formal point of
    view, G6 is just a slightly eccentric way of writing G8:

      {G8} name = ['a'-'z'].  {input: 'p'}

I would rather not require that processors be required to detect the
identity of the two sides of the choice in G6.  I would also not like to
forbid them to do so.

  {G9} name = A; A. A = ['a'-'z']. {input: 'p'}

G9 illustrates the same principle on the level of nonterminals;
formally, there is only one right-hand side for 'name', and the fact
that it is written twice makes no more difference than writing the set
of numbers from one to three as {1, 2, 3, 3} with four not three
expressions within the braces.

There is only one parse tree here (<name><A>p</A></name>) regardless of
which branch of the choice one takes.

  {G10} name = ['a'-'z']+; ['a'-'z'; 'A' - 'Z']+.  {input: 'p'}
  
  {G11} name = A; B. A = ['a'-'z']+. B = ['a'-'z'; 'A' - 'Z']+.  {input: 'p'}

Grammars G10 and G11 illustrate the proposition that it's not *just* a
question of identifying identical right-hand sides:  'p' matches both
left and right sides of the choice, but they are not identical.

G10 also illustrates the fact that for formal language theory,
['a'-'z'] is not in fact a terminal symbol but just a shorthand for a
twenty-six way choice.  Formally, 'name' is defined here as

  name = 'a'.
  name = 'b'.
  ...
  name = 'z'.
  name = 'A'.
  ...
  name = 'Z'.

So the input 'p' is not ambiguous against G10, from a formal point of
view.

The reason I wrote this note was to ask a question of those whose
intuition is that the inputs specified should be regarded as ambiguous
against G1, G6, and G10.  Consider G12:

  {G12} name = ['a'-'z'; 'a'-'z'; 'A' - 'Z']+.  {input: 'p'}

Here there is only one terminal (as ixml defines the term), not two.

Is your intuition that the input 'p' is ambiguous against grammar G12,
or unambiguous?


Michael

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
Received on Sunday, 12 June 2022 19:07:26 UTC