Re: Is this ambiguous? from C. M. Sperberg-McQueen on 2022-02-14 (public-ixml@w3.org from February 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Sun, 13 Feb 2022 19:34:51 -0700
To: Bethan Tovey-Walsh <accounts@bethan.wales>
Cc: Norm Tovey-Walsh <norm@saxonica.com>, public-ixml@w3.org
Message-ID: <87wnhych10.fsf@blackmesatech.com>
Bethan Tovey-Walsh writes:

> I would tentatively suggest we consider saying that implementations
> must report that there is more than one possible valid vxml
> output. That wouldn’t require committing to a single definition of
> ambiguity, and would (I suspect) be the most useful information for
> users. I don’t imagine a user being particularly interested in the
> intermediate (and potentially ambiguous) states their grammar goes
> through in order for my implementation to parse an input. I do imagine
> their being interested in knowing that the vxml output they’ve
> received is one amongst a larger number of possible outputs.

I think that amounts to saying that what must be reported is that there
is more than one XML document (wave hands here in the direction of the
canonicalization spec) that reflects the input string and the grammar
and markings in the input grammar.

I agree that that is likely to be the easiest to explain.

I see three complications that make me reluctant to adopt this approach
in the spec right now.

1 One motive for reporting ambiguity is that it tends to slow down an
Earley parser.  It does so whether the ambiguity results in a different
XML output or not, so reporting ambiguity only when it affects the XML
may be suboptimal.

2 Another motive is that ambiguity may be a symptom of something wrong
in the grammar.  That too is a question of the raw parse tree (what you
would get if you had no marks in the grammar), not the XML output.

On the other hand: I can alway inform the user of the ambiguity anyway,
using a different signal.  And even if I don't, the user who wants to
know whether a sentence is unambiguous against the grammar without marks
can simply remove all the marks from their grammar.  So neither 1 nor 2
seems conclusive.

3 I conjecture that if a processor creates a data structure representing
a shared parse forest, or a parse forest grammar, then it will be
possible to tell by inspection of the data structure / parse-forest
grammar whether there is more than one XML representation for the output
or not.  If that's so, then any implementation which builds such a data
structure (like nineml) or constructs a parse-forest grammar describing
the result (as Aparecium will do, someday, if the gods are willin' and
the creek don't rise) will have no problem with an 'ambiguity' rule
focus on the abstract syntax tree instead of on the parse tree.  (I put
scare quotes around it because when applied only to ASTs it is no longer
ambiguity as I normally understand the term.)

But what I have is a conjecture, not a proof.

And nothing in the spec currently requires implementations to create
such data structures.  For other ways of implementing ixml, I don't
think we have even a conjecture that the problem can be solved without
materializing all of the possible parses of the input as XML documents
and comparing them for equality, until it's known whether there is more
than one XML representation.  If there are 48 parses, we don't have to
compare all 48 trees pairwise:  we only have to compare the first two,
to see if they are different.  If they are, then we know the sentence
has multiple XML representations.  If they aren't, then we need to
compare either the first or the second (it doesn't matter which, since
they are the same structurally) to the third, and so on.  As soon as we
find a tree that differs from the first tree, we are done.  As soon as
we reach the end of the set of trees, we are done.  The problem is,
there is no guarantee that either of those events will ever happen.

Those not deeply interested in this question may stop reading now; I
have made my point.  What follows is a technical excursus on an example.

- Michael


Excursus: detection that an ambiguous sentence has a unique XML
representation

Consider the grammar

  S = A.
  A = A; B.
  B = '◎'.

I believe this grammar defines a language with one sentence: {'◎'}.  For
that sentence, there are a countably infinite number of parse trees,
with one A, with two, with three, and so on.  One for every positive
integer.  If we replace the second rule with

  -A = A; B.

then I believe that each of those parse trees reduces to the same XML,
namely <S><B>◎</B></S>.  I hope it's clear that they do.  What I do not
believe is that it will be equally clear to a routine in a programming
language tasked with taking the set of Earley items generated during
recognition of the sentence and generating parse trees.

In fact, thinking about this concrete example, I'm not entirely certain
I know how to make a processor see that all the XML representation are
the same by inspection of the parse forest grammar.  That grammar might
look like this:

  S·0·1 = A·0·1.
  -A·0·1 = A·0·1; B·0·1.
  B·0·1 = '◎'.

In a parse forest grammar, each nonterminal has three parts:  the first
part of the symbol is a nonterminal symbol from the original grammar,
the second is a starting position in the input, and the third is an
ending position in the input.  The utility of the labels is easier to
see in examples where the parse forest grammar is not exactly the same
as the original grammar but with funny affixes. 
 
The existence of two right-hand sides for A·0·1 tells us the sentence is
ambiguous.  The fact that there is a cycle from A·0·1 to A·0·1 tells us
the sentence is infinitely ambiguous.  The conjecture I mentioned
earlier says we (and more importantly a program) can also see from this
grammar that there is only one XML representation of the sentence; when
I try to say how that works, however, I am not confident I know how to
write the test.

Here's an attempt; let's see if it works for this case.

1 Take the closure of the parent/child relation among nodes in the parse
trees generated the first and the second right-hand sides of the
ambiguous nonterminal, preserving the effective markings (thus treating
N and -N and @N as three different symbols).

In the example that gives us:

  (first alt): { -A·0·1, B·0·1, substring($input, 1, 1) }
  (second alt): { B·0·1, substring($input, 1, 1) }
  
2 From each set, remove all symbols marked hidden.

In the example that gives us:

  (first alt): { B·0·1, substring($input, 1, 1) }
  (second alt): { B·0·1, substring($input, 1, 1) }

3 If the two sets are the same, the XML representations will be the
same.

Informally:  the nonterminals and terminals not marked hidden are those
which are going to show up in the XML representation.  If they are the
same, the XML representation will be the same.

Unfortunately, that last sentence is not true.  Here is a counterexample.

  S = A; B.
  -A = C.
  -B = D.
  C = D.
  D = C; 'x'.

In the parse forest grammar, every nonterminal will have ·0·1 appended
to its name; since they are all the same in this example, they convey no
information and I'll spare myself the effort.  S has two right-hand
sides, each of which generates the same set of non-hidden nodes: { C, D,
substring($input, 1, 1) }.  But <S><C><D>x</D></C></S> and
<S><D><C>x</C></D></S> aren't the same, even though they have the same
three node labels.

In this case, I think we can see that the grammar generates at least two
different trees because the two right-hand sides of D generate two
different sets of non-hidden nodes.

A third example:

S = A; B.
-A = C, D.
-B = D, C.
C = .
D = .

Here only S is ambiguous, and its two RHS generate the same set of
labeled nonterminals but not the same XML tree.

There may be a way to prove the conjecture, but so far what I've learned
this evening is that the method I had in mind can suffice to prove that
there are at least two different XML trees, but not, in general, to
prove that there is only one.  It might work on some subset of
parse-forest grammars.


Summing up: I can see the appeal of defining the ambiguity-symbol as
applying when there is more than one possible XML output for a sentence.

But I don't want the ixml spec to require that processors be able to
tell for certain whether a given sentence will generate more than one
different XML representation, because I do not know how to do it for all
cases.

Michael



> But there’s a good chance I’m overlooking something obvious which will make this suggestion unworkable. 
>
>
>
>> On 13 Feb 2022, at 23:03, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote:
>> 
>> 
>> Norm Tovey-Walsh writes:
>> 
>>> Consider this grammar:
>> 
>>> list: word + -',' .
>>> word: c, v, c ; c, v, v .
>>> -c: ["bcdfghjklmnpqrstvwxyz"] .
>>> -v: ["aeiouy" ].
>> 
>>> and this input: “hey,bee”.
>> 
>>> By one reckoning, that’s ambiguous: “hey” can be either cvc or cvv
>>> because I’ve identified “y” as both a consonant and a vowel.
>> 
>>> But “c” and “v” are both elided from the output, so the generated XML
>>> is identical for both parses. By that reckoning, it isn’t ambiguous.
>> 
>> Nice example.
>> 
>>> I expect our intent is that it *isn’t* ambiguous…but I thought I’d
>>> check.
>> 
>> For what it's worth ...
>> 
>> If we as a group have an intent here, I don't know what it is.  Issue 26
>> asks how we wish to define "ambiguity", but the upshot of the discussion
>> starting from your message of 5 January [1] seems to me to have been
>> only that some members of the CG do not wish to define it; without a
>> definition of what counts as ambiguity, I don't think we can have a
>> coherent intent.
>> 
>> If I had to predict what the CG will end up doing, my money would be on
>> leaving the effective definition of ambiguity implementation-defined and
>> specifying that processors MAY report ambiguity, rather than MUST, or
>> possibly specifying that IF processors detect ambiguity they MUST report
>> it (modulo user option to suppress the ambiguity flag in the output),
>> but not requiring that they detect ambiguity whenever it exists.  (I
>> predict there will be CG members who would like to require the detection
>> of ambiguity, but without a crisp definition of ambiguity that
>> requirement lacks teeth.)
>> 
>> One difficulty is that we have already seen that different
>> implementations of ixml use different underlying parsing methods, even
>> when the implementors all say they are using Earley parsing.  But:
>> 
>> - Implementations that parse using the ixml grammar directly and those
>>  which translate the ixml grammar to BNF for parsing are working with
>>  different grammars; their raw parse trees will differ and ambiguity in
>>  one doesn't always mean ambiguity in another.
>> 
>> - Implementations which translate the ixml grammar to BNF will not
>>  necessarily use the same translations -- the obvious requirement is
>>  that the grammar be equivalent and allow the construction of the XML
>>  abstract syntax tree, and there is more than one BNF that meets those
>>  criteria.
>> 
>> - I think that some ways of recording parsing results may make it easy
>>  to see whether there is more than one XML AST for a given sentence,
>>  but I'm not sure that's true for every possible approach.
>> 
>> We don't want to constrain the internals of any implementation, and we
>> want interoperability, and we want ambiguity to be flagged.  I don't
>> think all of those three can be combined in their pure form; we are
>> going to have to weaken one or more of them.
>> 
>> My two cents.
>> 
>> Michael
>> 
>> -- 
>> C. M. Sperberg-McQueen
>> Black Mesa Technologies LLC
>> http://blackmesatech.com
>> 
>> [1] https://lists.w3.org/Archives/Public/public-ixml/2022Jan/0030.html
>> 


-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
Received on Monday, 14 February 2022 02:35:17 UTC