Re: A tricky grammar from C. M. Sperberg-McQueen on 2022-03-17 (public-ixml@w3.org from March 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Thu, 17 Mar 2022 16:27:48 -0600
To: Dimitre Novatchev <dnovatchev@gmail.com>
Cc: Norm Tovey-Walsh <norm@saxonica.com>, public-ixml@w3.org
Message-ID: <874k3w9puj.fsf@blackmesatech.com>
Dimitre Novatchev writes:

> Sorry to interrupt your group's discussions and probably show lack of
> understanding.

That would, I think, be very difficult for you to do.  Thank you for
your intervention.

> But 30 years ago I had a post-doc scholarship with Masaru Tomita at CMU and
> at that time implemented in C his algorithm for NL parsing -- GLR
> (Generalized LR Parsing).

> This parsing algorithm can handle any NL grammar ambiguities, it even
> produced all possible parses for "Time flies like an arrow", and one of
> these interpreted "time-flies" as the subject. So, "time-flies" like (love)
> an arrow :)

> Maybe you could consider implementing and using Tomita's GLR?

That may be a good idea.  As Tom has mentioned, the spec does refer to
GLR as a possible implementation strategy (which is a change from the
earliest descriptions of invisible XML, which always mentioned Earley
parsing), but as far as I know none of the implementations now being
worked on uses Tomita / GLR parsing.

In considering your mail, I discover that I seem to feel a little guilty
about this, so I have spent a little time just now asking myself why
this is so.

It's not that I haven't heard of GLR parsing.  On the contrary:
somewhere in one of the stacks of partly-read and partly-understood
papers gradually taking over my study there are some recent papers on
variations and refinements of Tomita's parser.  The day may come when I
finish them, understand the algorithm better, and try to implement it.

So far, though, I have not done so.  Maybe it's just that I started by
studying Earley's algorithm and have had my hands full with other
things.  Maybe it's that GLR seems still not to be widely understood.

For what it's worth, though, I have the impression that GLR is harder to
understand than Earley -- although I won't claim that Earley is easy to
understand.  To me it felt like very hard work when I was trying to
understand why I should trust the Earley algorithm and how to express it
in declarative terms.  When reading the papers on GLR, I lost heart at
the point where I saw that the key to the generality of the GLR method
involves something that looked at first glance like simulating a
parallel processor in the parser and maintaining multiple parsing stacks
in parallel.  Apart from the conceptual complexity, it did not look at
first glance like a good match for a purely functional language like
XQuery or XSLT.

Like GLR, Earley parsing can handle ambiguity, so if (as I believe) both
Norm's program and Steven's program are imlementations of the Earley
algorithm, then in principle neither of them was doomed to crash on this
input.  Without having seen the internals of either, I will risk a guess
that the problems they encountered lie not in the recognition of the
sentence (or: in the construction of the chart) but in the extraction of
the parse tree from the chart. It is at that point that any failure to
detect a loop is likely to lead to a runaway process.  And that is also
the point at which Earley waves his hands and essentially leaves the
extraction of the parse tree form as an exercise for the reader ...

One reason for being interested in GLR parsing is the possibility that
on many grammars it may be faster than Earley parsing.  Earley parsing
is known to be cubic in the worst case, but is said to be linear for any
grammar for which a deterministic parsing method would work.  Grune and
Jacobs say, however, in their book Parsing Techniques, that

    if any of the general parsers [i.e. Unger's, Earley's, GLR] performs
    in linear time, it may still be a factor of ten or so slower than a
    deterministic method, due to the much heavier administration they
    need.

They also observe that GLR is harder to implement than Earley.

The crucial point, is probably one Tom made already: new implementations
of ixml are always welcome!

Michael


-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
Received on Thursday, 17 March 2022 22:28:10 UTC