Re: a surprising whitespace-related ambiguity from C. M. Sperberg-McQueen on 2024-03-19 (public-ixml@w3.org from March 2024)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Tue, 19 Mar 2024 07:56:38 -0600
To: "Liam R. E. Quin" <liam@fromoldbooks.org>
Cc: LdBeth <andpuke@foxmail.com>, public-ixml@w3.org
Message-ID: <87h6h2mhnm.fsf@blackmesatech.com>
Thank you.  I think you lay out some of the reasons it can be
challenging to create in ixml (or in any purely grammatical
notation) a grammar that captures exactly the same set of sentences,
and assigns the same (or: structurally similar) parse trees to them,
as existing tools for existing languages of any complexity.

I think there is a general question possibly worth enunciating and
possibly of interest from a computer-science point of view, as well
as a practical specific instance of that question:

(1) Given an ambiguous context-free grammar G and a rule R which
chooses, for any ambiguous sentence in L(G), one of the possible parse
trees, under what circumstances is it possible to create an unambiguous
context-free grammar G' such that (a) L(G') = L(G) and (b) the parse
trees produced by G' are compatible with those produced by G?  (There's
a little hand-waving here, because I don't have a clear definition of
'compatible' -- operationally, what I mean is that without extraordinary
efforts we can derive the same abstract syntax tree from them that we
would derive from the tree selected by R.  But I don't know of any
general formal definition for the mapping from parse trees to ASTs.)

It's clear that for some (G, R) pairs the task is impossible.  Any
inherently ambiguous language will defeat the attempt to create a
suitable G'.  There may be ways of formulating the rule R that will
always defeat any attempt to replicate the results grammatically in G'.

It's also clear that for some (G, R) pairs the task is straightforward.

(2) A more concrete formulation of the question: given a yacc grammar G
with some operator-precedence and -associativity declarations, can we
construct an unambiguous grammar G' that accepts the same sentences and
produces compatible parse trees?

If the only operator declarations are for left-associative binary
operators, then we can clearly construct G' mechanically from G and the
precedence declarations.  If we add unary operators which all bind more
tightly than any of the binary operators, the task is still
straightforward.

When some unary operators bind less tightly than some binary operators,
then can the task be performed or not?  I don't know the answer; I only
know that my attempts so far have failed, either by rejecting some
inputs accepted by yacc and yacc-style parsers or by rendering some
sentences ambiguous.

....

In the case of the comment "--- hi mom", the difficulty is precisely
that of finding a way to say, with purely grammatical means (ie in ixml
without extensions or pragmas), that it must be parsed as a comment and
not as "-" followed by a comment.  (Similar ambiguities could be
constructed with slash, for comments beginning // or bounded by /*
... */.)

Michael

"Liam R. E. Quin" <liam@fromoldbooks.org> writes:

> On Mon, 2024-03-18 at 22:09 -0600, C. M. Sperberg-McQueen wrote:
>> Thank you.  I'm n
>
> Some systems (e.g. lex) resolve ambiguities like - vs. -- with a
> longest matching token rule. But this relies on a separate tokenisation
> pass. This is used e.g. for a +++ b and ---c in the C programming
> language (both legal, and equivalent to a + (++b) and - (--c), where ++
> and -- are unary autoincrement/decrement operators).
>
> On the other hand sock ** shoe; relies on knowing wither "sock" has
> been declared earlier in the input as a variable or as a type, so there
> is also feedback from the parser to the lexer.
>
> Both CSS and XPath have the problem that identifiers can contain
> hyphens, so that space is needed in a subtraction a - b to distinguish
> it from an identifier called "a-b".
>
> Hmm, not sure i’m helping.


-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
Received on Tuesday, 19 March 2024 14:16:09 UTC