Re: resolving ambiguity from C. M. Sperberg-McQueen on 2021-04-14 (public-ixml@w3.org from April 2021)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Wed, 14 Apr 2021 17:24:34 -0600
To: John Lumley <john@saxonica.com>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, Aleksei Guzev <aleksei@guzev.pro>, Tom Hillman <tom@expertml.com>, public-ixml@w3.org
Message-Id: <E0E60160-55A5-4E95-8C15-4BE719E9E5FF@blackmesatech.com>
John seems to me to hit the nail on the head.  An ixml parser should return a parse tree if a parse tree exists; if more than one parse tree exists, it should return one of them, and the ixml spec should not specify which.

On the question of ambiguity detection, if I understand correctly, the current state of play on unrestricted context-free grammars is that the problem is undecidable in the general case, but there are techniques that can be applied, which can produce both false positives and false negatives (i.e. they can flag things that are not in fact ambiguous or they can fail to detect some ambiguities).  I do not know whether the techniques found by a search for "ambiguity detection in context-free grammars” are useful in practice.  The reports are always enthusiastic, but then the reports are written by people who have just managed to invent a new method and get it to work, of course they are enthusiastic.

Implementing the ambiguity detection method reported in [1] is on the someday list for Gingersnap, my small but growing collection of tools for working with ixml grammars [2], but it won’t happen very soon, because writing an XSLT version of Aparecium and working to improve the test-case generation tools in Gingersnap are both ahead of it on my list.

One of the things I find most appealing about ixml as currently specified is the irreducible simplicity of the idea:  give it a context free grammar — *any* context-free grammar — and it will parse the input.  The grammar doesn’t have to satisfy any other constraints — it doesn’t have to be LALR(1) or LL(k) for some small k, or anything of the kind.  The grammar writer doesn’t need to understand, much less deal intelligently, with reports like “21 shift/reduce conflicts, 4 reduce/reduce conflicts”.

It’s not hard (I think) for people to write context-free grammars and get them right (in the sense that they generate the language intended, and the parse trees are useful).  It may be harder to get them completely unambiguous.  I often produce ambiguous grammars if I’m not careful about where whitespace is allowed, by allowing it in too many places.  Since in the languages I’m trying to define whitespace is not significant, any parse tree will do in those cases, and it’s only some misplaced sense of personal pride that makes me want to make the grammars unambiguous.  But in my experience it’s much harder to make a parser be acceptable to yacc or other parser generators, unless one is willing to learn more about the internals of the parsing mechanism than I initially wanted to learn.

Oddly enough, I suppose the simplicity and generality of ixml are in their own way a form of protecting the grammar writer:  ixml protects the grammar writer from having to care about whether the grammar is one that yacc or similar software can deal with or not.  It may be inconsistent of me to want to protect the grammar writer in this way, but not by prescribing a way for dealing with ambiguity.  

Michael

[1] https://cs.au.dk/~amoeller/papers/ambiguity/journal.pdf
[2] https://github.com/cmsmcq/gingersnap


> On 14,Apr2021, at 3:42 PM, John Lumley <john@saxonica.com> wrote:
> 
> Perhaps we are trying too much to protect the grammar writer too much? If you construct a grammar that is ambiguous (and I suspect that is too easy a helphalum-trap into which one falls) then any return that meets the grammar should be acceptable. This implies that you really should aim for an unambiguous grammar, but IXML might not be able to assist (or can it? Can we determIne whether the grammar can be ambiguous?) Or is prohibition on unambiguous grammars too restrictive?
> 
> Sent from my iPad
> 
>> On 14 Apr 2021, at 17:56, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com> wrote:
>> 
>> Maybe I’ve worked too much with Prolog.  Maybe I’ve read too much Dijkstra.  But I don’t understand why we should expend any effort to eliminate non-determinism in the parsing of ambiguous sentences. 
>> 
>> If as a grammar writer I care which parse tree is produced, then I should write a grammar in which no sentence is ambiguous.  
>> 
>> If I cannot do that — for example because the language I am attempting to describe is inherently ambiguous — then I am probably writing the grammar to illustrate a point about ambiguity in context-free grammars and context-free languages.  And in that case I do not want a grammar notation or a parser that tries to be helpful by defining rules for resolving the ambiguity.  (I can’t prove that there are no languages of practical interest that are inherently ambiguous, but I have not seen any claim that they exist.)
>> 
>> In fact, if we were using the terminology used in XQuery, I would favor saying that if there is more than one parse tree, the rules for choosing which tree is returned to the user should be implementation-dependent.  (In XQuery speak, that means that it may vary from implementation to implementation or moment to moment, and implementations are actively discouraged from documenting how the choice is made.  This is a way of trying to discourage people from relying on implementation-dependent behavior.  It’s different from implementation-defined behaviors, which implementations are required to document.)
>> 
>> If I am the only one who feels this way, I won’t insist that the rules for choice of parse tree should be implementation-defined, when the user wants only one tree.  If everyone else wants it to be implementation-defined, I’ll give in.  But I do resist defining any rules for this case in the spec.  (And Aparecium will, whenever possible, use a random-number generator to choose which tree to return.)
>> 
>> 
>> Michael
>> 
>> 
>>> On 14,Apr2021, at 8:48 AM, Aleksei Guzev <aleksei@guzev.pro> wrote:
>>> If we had the "greedy by default" rule, as regular expression do, then the first path <S><Left>ab</Left><Right>c</Right></S> wins because 'Left' consumed as much as it could. While preventing ambiguous grammars does not seem achievable, the rule could make a grammar more deterministic.
>>> On Wed, 14 Apr 2021 at 14:52, Tom Hillman <tom@expertml.com> wrote:
>>> In fairness, I suggested it nervously, and only then as part of an approach for the input consumption discussion: I would prefer it if we leave the choice open.
>>> I think the answer to your first rule wins question is <S>(<a>a </a>)</S>. That is possibly because recursing into each non-terminal before handling its siblings is second nature to me because it mirrors the document order of XML trees ;)
>>> Tom
>>> _________________
>>> Tomos Hillman
>>> eXpertML Ltd
>>> +44 7793 242058
>>>> On 14 Apr 2021, 01:42 +0100, C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>, wrote:
>>>> On today’s call, Tom suggested that the spec might usefully say which tree should be returned, in case of ambiguity.
>>>> I am nervous about this suggestion; I think that the choice of tree is better left open.
>>>> "First rule wins" seems plausible enough when it applies, but I have not thought enough about it to know whether it always applies, whether it always provides a unique solution, or how to extend the Earley algorithm (or any of the other general algorithms) to keep track of such things in a way that would guarantee the selection of the correct tree. Consider this grammar:
>>>> S: Left, Right.
>>>> Left: "a", "b"; "a".
>>>> Right: "b", "c"; "c".
>>>> If I'm not mistaken, this has two parse trees:
>>>> <S><Left>ab</Left><Right>c</Right></S>
>>>> <S><Left>a</Left><Right>bc</Right></S>
>>>> The first tree uses the first RHS for Left and the second RHS for Right. The second tree is the other way round. Does the "first rule wins" principle give an answer here? I suppose we can say that it's the choice of rules for 'Left' that matters, because 'Left' precedes 'Right'. And I guess that's what you meant.
>>>> But consider:
>>>> S: '(', a, s?, ')'.
>>>> a: s?, 'a', s?.
>>>> s: #20+.
>>>> and the input string "(a )". Which of the parses is preferred by the first-rule-wins rule?
>>>> <S>(<a>a </a>)</S>
>>>> <S>(<a>a</a> )</S>
>>>> Michael
>>>> ********************************************
>>>> C. M. Sperberg-McQueen
>>>> Black Mesa Technologies LLC
>>>> cmsmcq@blackmesatech.com
>>>> http://www.blackmesatech.com
>>>> ********************************************
>> 
>> ********************************************
>> C. M. Sperberg-McQueen
>> Black Mesa Technologies LLC
>> cmsmcq@blackmesatech.com
>> http://www.blackmesatech.com
>> ********************************************
> 

********************************************
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
cmsmcq@blackmesatech.com
http://www.blackmesatech.com
********************************************
Received on Wednesday, 14 April 2021 23:24:58 UTC