Re: Side issue - ambiguity from Norm Tovey-Walsh on 2022-06-20 (public-ixml@w3.org from June 2022)

From: Norm Tovey-Walsh <norm@saxonica.com>
Date: Mon, 20 Jun 2022 10:14:01 +0100
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
Cc: public-ixml@w3.org
Message-ID: <m2zgi7hc9j.fsf@Hackmatack.fritz.box>
"C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com> writes:
> This is not a spec-related issue any more, since the spec is explicit
> that conforming processors are allowed to vary in their flagging of
> ambiguity -- but because I am interested in parsing I continue wishing
> to understand the views of other group members on the topic, whether
> they are illustrated concretely by the behavior of a processor or just
> intuitions.

I’ve been holding back on replying to this message because I don’t
really trust my intuitions. I wanted to see what my processor said. It’s
finally working well enough again to run some tests. :-)

> Consider the following grammars and input strings:
>
>   {G1} S = 'a'*; 'b'*.  {input: empty string}
>
> G1 follows the pattern of the test case that led us to realize that
> different processors produce different results in cases of ambiguity.
> Accepting the empty string can be justified by appeal to either the left
> or the right choice, so it feels inherently ambiguous to some.  On the
> other hand, there is only one parse tree, namely <S/>.

Indeed. My processor does not report ambiguity here. But I’m suspicious
that this might be a bug. Or a feature. The grammar I get is:

$$ ::= S
S ::= $1_La-star
S ::= $2_Lb-star
$1_La-star ::= $3_La-optionⁿ
$2_Lb-star ::= $4_Lb-optionⁿ
$3_La-optionⁿ ::=
$3_La-optionⁿ ::= 'a', $1_La-star
$4_Lb-optionⁿ ::=
$4_Lb-optionⁿ ::= 'b', $2_Lb-star

Which sure looks ambiguous. But I consider those hidden nonterminals to
be “prunable” when I’m cleaning up the parse forest. This reduces the
number of nodes that have to be considered when producing trees.

The structure of ixml and my BNF means I get a lot of long tails of

  s -> $1 -> $2 -> $3 -> ε

and reducing that to

  s -> ε

seemed like a good idea. But I’m suspicious that in this case pruning
removes a choice from the graph:

    +-> $1_La-star -> $3_La-option -> ε
 S -+                       
    +-> $1_Lb-star -> $4_Lb-option -> ε

becomes

 S -> ε

And fails to detect the ambiguity. I’ll have to try turning off the
pruning and seeing what the result is.

(Time passes)

Yes, that’s exactly what happens. So is the pruning a bug or a feature?

>   {G2} S = A; B. A = 'a'*. B = 'b'*.  {input: empty string}
>
>   {G3} S = A; B. A = ; 'a', A. B = ; 'b', B.  {input: empty string}
>
> G2 gives names to the left and right choices in the rule for S.  G3 goes
> further an rewrites the grammar as BNF; it is the BNF to which an
> implementation might choose to reduce G1 or G2.  For both of these
> grammars, there are two parse trees for the empty string: <S><A/></S>
> and <S><B/></S>.  So the sentence is clearly ambiguous.

Yep.

>   {G4} S = A; B. -A = 'a'*. -B = 'b'*.  {input: empty string}

I say that’s ambiguous.

> As a user, I lean toward the latter (easier to understand), as an
> implementor toward the former (easier to implement).

I strongly favor saying this is ambiguous, and not just because that’s
what my parser does! :-) Ambiguity makes the parse forest larger and, as
a user, I want to know there’s ambiguity here because I (may) want to
fix it.

Would it even be possible to conclude that this was unambiguous without
comparing all of the possible trees? I can’t immediately think of a way
to make the parse unambiguous in the “raw” state.

> The fact that ambiguity is not crisply defined for EBNF certainly
> complicates the matter.  But there appears to be more to it.  Consider
> the next two grammars.
>
>   {G6} name = ['a'-'z']; ['a'-'z'].  {input: 'p'}
>
>   {G7} name = A; B. A = ['a'-'z']. B = ['a'-'z'].  {input: 'p'}
>
> G6 and G7 seem to illustrate the same points as G1 and G2, but they are
…
>     different production rules but only one.  So from a formal point of
>     view, G6 is just a slightly eccentric way of writing G8:
>
>       {G8} name = ['a'-'z'].  {input: 'p'}

Yep. My rewrite of G6 produces 

$$ ::= name
name ::= ["a"-"z"]

so there’s no ambiguity. My rewriter doesn’t add identical rules to the
grammar. I think it could add a duplicate rule, but I don’t think it
would be an improvement :-)

> I would rather not require that processors be required to detect the
> identity of the two sides of the choice in G6.  I would also not like to
> forbid them to do so.
>
>   {G9} name = A; A. A = ['a'-'z']. {input: 'p'}
>
> G9 illustrates the same principle on the level of nonterminals;

Same result for my implementation:

$$ ::= name
name ::= A
A ::= ["a"-"z"]

> There is only one parse tree here (<name><A>p</A></name>) regardless of
> which branch of the choice one takes.
>
>   {G10} name = ['a'-'z']+; ['a'-'z'; 'A' - 'Z']+.  {input: 'p'}
>   
>   {G11} name = A; B. A = ['a'-'z']+. B = ['a'-'z'; 'A' - 'Z']+.  {input: 'p'}
>
> Grammars G10 and G11 illustrate the proposition that it's not *just* a
> question of identifying identical right-hand sides:  'p' matches both
> left and right sides of the choice, but they are not identical.

Right. Both ambiguous because there are two different rules for “name”.

> G10 also illustrates the fact that for formal language theory,
> ['a'-'z'] is not in fact a terminal symbol but just a shorthand for a
> twenty-six way choice.  Formally, 'name' is defined here as

Mmmm. I don’t actually implement it that way. And for ~[] it wouldn’t be
*practical* to implement it that way!

> The reason I wrote this note was to ask a question of those whose
> intuition is that the inputs specified should be regarded as ambiguous
> against G1, G6, and G10.  Consider G12:
>
>   {G12} name = ['a'-'z'; 'a'-'z'; 'A' - 'Z']+.  {input: 'p'}
>
> Here there is only one terminal (as ixml defines the term), not two.

I don’t currently attempt to remove the duplicate range, but it makes no
difference:

$$ ::= name
name ::= $1_plus
$1_plus ::= ["a"-"z";"a"-"z";"A"-"Z"], $2_star
$2_star ::= $3_optionⁿ
$3_optionⁿ ::=
$3_optionⁿ ::= ["a"-"z";"a"-"z";"A"-"Z"], $2_star

I’m matching against that range and ‘p’ matches and there’s no
ambiguity.

(I did consider checking for duplicate ranges and avoiding them, but I
decided against it. They only arise if an author explicitly writes them,
which is unlikely, and there are no bugs in code that remains
unwritten.)

> Is your intuition that the input 'p' is ambiguous against grammar G12,
> or unambiguous?

I can see how it might be, if you expand name and allow duplicates,
you’d get name=p twice. But since that’s a completely impractical way to
implement it, I think it’s unlikely to be regarded as ambiguous even if
g10 is.

Alternatively, I suppose I could rewrite

 name = ['a'-'z'; 'a'-'z'; 'A' - 'Z']+.

as 

 name = (['a'-'z']; ['a'-'z']; ['A' - 'Z'])+.

but then I’d collapse the two [a-z] rules and it still wouldn’t be
ambiguous because “p” doesn’t match [A-Z].

Hope that’s helpful.

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica
Received on Monday, 20 June 2022 10:15:23 UTC