Re: RelaxNG derivatives vs Earley parsing from C. M. Sperberg-McQueen on 2022-07-14 (public-ixml@w3.org from July 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Thu, 14 Jul 2022 08:10:54 -0600
To: M Joel Dubinko <micah@dubinko.info>
Cc: public-ixml@w3.org
Message-ID: <871qunoj07.fsf@blackmesatech.com>
M Joel Dubinko <micah@dubinko.info> writes:

> There’s a lot in common between James Clark’s algorithm and Earley
> parsing. In particular, the “dot” notation (a la Wikipedia)

>  (X → α • β, i), consisting of 

> * the production currently being matched (X → α β)
> * the current position in that production (represented by the dot)
> * the position i in the input at which the matching of this production
>   began: the origin position

> Sounds an awful lot like the description of a derivative:
> "The derivative of a pattern p with respect to a node x is a pattern
> for what's left of p after matching x;"

> The portion after the dot IS a derivative, right? Are there subtle
> differences I’m missing?

Is the portion after the dot a derivative?  I think the answer is a firm
"maybe".  Or "sometimes".  Or "it depends on what the meaning of is,
is".

For grammars in BNF, where each production rule has one or more 'alt'
elements, each containing a sequence of terms, placing the dot anywhere
in a sequence divides the sequence into two subsequences, each of which
is a well-formed expression.  If the dot is at the beginning or the end,
at lease one of the subsequences is empty, but that's also a well-formed
sequence of terms.

But if a production rule is

    A = B; C, D, E; F.

and we match a C, which of the following do we have in mind as the dot
notation for the current parsing state?

    A = B; C · , D, E; F.
    A = C · , D, E.

If the first, then the part after the dot is

    , D, E; F

which doesn't look like a well formed expression.  If it's not, then it
cannot be a derivative.  Perhaps it's obvious that the leading comma
should be deleted, but then we have a well-formed expression

    D, E; F

which does NOT describe what remains to be matched: the original does
not foresee an F following a C.

If the second is what we have in mind as the dot notation for the
current parsing state, then what happened to the B and the F?

In the case of production rules that are not in BNF form but take the
form of arbitrary expressions over the grammar's vocabulary, the dot
notation is pretty clearly another way to identify "what remains to be
parsed", but it's also pretty clearly not a derivative in the sense of
being a stand-alone expression that describes the remaining suffix.  For
example, consider the production rule for 'alt' in the ixml spec
grammar:
    
    alt: term**(-",", s).

If we read a term, the dot notation would be (at least, given what I
think is the only plausible way to apply dot notation to EBNF)

    alt: term · **(-",", s).

But the part to the right of the dot is not meaningful as an expression
and is not really the derivative.  I would write the derivative of
term**(-",", s) with respect to term as

    (-",", s), term**(-",", s)



> I’m thinking that Rust, with it’s nice Haskell-like “data class”
> enums, might be able to adapt some of the logic from [1] to make a
> very clean, readable implementation.

I think that might well be so.

At an abstract level (or: *an* abstract level, possibly one of many), an
Earley item can be described as a 4-tuple (x, y, N, L), where x and y
are locations in the input (e.g. integers between 0 and |input|
inclusive), N is a nonterminal in the grammar, and L is a location in
the production rule for N.

From N and L we can see what remains to be parsed and we can, on demand,
produce an expression for that, which is a derivative in the strict
sense.

From N and a sequence of symbols in the vocabulary of the grammar, we
can calculate either a location in the production rule for N or the
derivative of the rule for N with respect to that sequence of symbols.

From N and L, or from a derivative, we can also answer questions like
"what could possibly come next" and "have we in fact already recognized
a complete N?" (i.e. "if nothing else comes next, are we done?").

And so on.  I cannot think of anything you can do with the pair (N, L)
that you can't do with the pair (N, D), where D is a derivative of the
rule for N, with respect to some sequence of symbols.  So while I don't
think that the dot notation *is* a derivative, or vice versa, I think
they are equivalent for all the purposes I have thought of so far.

A longer and perhaps more comprehensible attempt to describe Earley
parsing at an abstract non-procedural level can be found in a paper I
gave at Balisage a few years ago; I'll dig out the reference if you
can't find it.

Michael



-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
Received on Thursday, 14 July 2022 14:40:30 UTC