Re: resolving ambiguity from John Lumley on 2021-04-14 (public-ixml@w3.org from April 2021)

From: John Lumley <john@saxonica.com>
Date: Wed, 14 Apr 2021 22:42:54 +0100
To: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
Cc: Aleksei Guzev <aleksei@guzev.pro>, Tom Hillman <tom@expertml.com>, public-ixml@w3.org
Message-Id: <0A448622-C8CC-4571-9037-FE6EA571B001@saxonica.com>
Perhaps we are trying too much to protect the grammar writer too much? If you construct a grammar that is ambiguous (and I suspect that is too easy a helphalum-trap into which one falls) then any return that meets the grammar should be acceptable. This implies that you really should aim for an unambiguous grammar, but IXML might not be able to assist (or can it? Can we determIne whether the grammar can be ambiguous?) Or is prohibition on unambiguous grammars too restrictive?

Sent from my iPad

> On 14 Apr 2021, at 17:56, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote:
> 
> Maybe I’ve worked too much with Prolog.  Maybe I’ve read too much Dijkstra.  But I don’t understand why we should expend any effort to eliminate non-determinism in the parsing of ambiguous sentences. 
> 
> If as a grammar writer I care which parse tree is produced, then I should write a grammar in which no sentence is ambiguous.  
> 
> If I cannot do that — for example because the language I am attempting to describe is inherently ambiguous — then I am probably writing the grammar to illustrate a point about ambiguity in context-free grammars and context-free languages.  And in that case I do not want a grammar notation or a parser that tries to be helpful by defining rules for resolving the ambiguity.  (I can’t prove that there are no languages of practical interest that are inherently ambiguous, but I have not seen any claim that they exist.)
> 
> In fact, if we were using the terminology used in XQuery, I would favor saying that if there is more than one parse tree, the rules for choosing which tree is returned to the user should be implementation-dependent.  (In XQuery speak, that means that it may vary from implementation to implementation or moment to moment, and implementations are actively discouraged from documenting how the choice is made.  This is a way of trying to discourage people from relying on implementation-dependent behavior.  It’s different from implementation-defined behaviors, which implementations are required to document.)
> 
> If I am the only one who feels this way, I won’t insist that the rules for choice of parse tree should be implementation-defined, when the user wants only one tree.  If everyone else wants it to be implementation-defined, I’ll give in.  But I do resist defining any rules for this case in the spec.  (And Aparecium will, whenever possible, use a random-number generator to choose which tree to return.)
> 
> 
> Michael
> 
> 
>> On 14,Apr2021, at 8:48 AM, Aleksei Guzev <aleksei@guzev.pro> wrote:
>> If we had the "greedy by default" rule, as regular expression do, then the first path <S><Left>ab</Left><Right>c</Right></S> wins because 'Left' consumed as much as it could. While preventing ambiguous grammars does not seem achievable, the rule could make a grammar more deterministic.
>> On Wed, 14 Apr 2021 at 14:52, Tom Hillman <tom@expertml.com> wrote:
>> In fairness, I suggested it nervously, and only then as part of an approach for the input consumption discussion: I would prefer it if we leave the choice open.
>> I think the answer to your first rule wins question is <S>(<a>a </a>)</S>. That is possibly because recursing into each non-terminal before handling its siblings is second nature to me because it mirrors the document order of XML trees ;)
>> Tom
>> _________________
>> Tomos Hillman
>> eXpertML Ltd
>> +44 7793 242058
>>> On 14 Apr 2021, 01:42 +0100, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>, wrote:
>>> On today’s call, Tom suggested that the spec might usefully say which tree should be returned, in case of ambiguity.
>>> I am nervous about this suggestion; I think that the choice of tree is better left open.
>>> "First rule wins" seems plausible enough when it applies, but I have not thought enough about it to know whether it always applies, whether it always provides a unique solution, or how to extend the Earley algorithm (or any of the other general algorithms) to keep track of such things in a way that would guarantee the selection of the correct tree. Consider this grammar:
>>> S: Left, Right.
>>> Left: "a", "b"; "a".
>>> Right: "b", "c"; "c".
>>> If I'm not mistaken, this has two parse trees:
>>> <S><Left>ab</Left><Right>c</Right></S>
>>> <S><Left>a</Left><Right>bc</Right></S>
>>> The first tree uses the first RHS for Left and the second RHS for Right. The second tree is the other way round. Does the "first rule wins" principle give an answer here? I suppose we can say that it's the choice of rules for 'Left' that matters, because 'Left' precedes 'Right'. And I guess that's what you meant.
>>> But consider:
>>> S: '(', a, s?, ')'.
>>> a: s?, 'a', s?.
>>> s: #20+.
>>> and the input string "(a )". Which of the parses is preferred by the first-rule-wins rule?
>>> <S>(<a>a </a>)</S>
>>> <S>(<a>a</a> )</S>
>>> Michael
>>> ********************************************
>>> C. M. Sperberg-McQueen
>>> Black Mesa Technologies LLC
>>> cmsmcq@blackmesatech.com
>>> http://www.blackmesatech.com
>>> ********************************************
> 
> ********************************************
> C. M. Sperberg-McQueen
> Black Mesa Technologies LLC
> cmsmcq@blackmesatech.com
> http://www.blackmesatech.com
> ********************************************
Received on Wednesday, 14 April 2021 21:43:16 UTC