Re: Earley dot notation for X+ and X* from C. M. Sperberg-McQueen on 2022-07-26 (public-ixml@w3.org from July 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Tue, 26 Jul 2022 08:07:21 -0600
To: M Joel Dubinko <micah@dubinko.info>
Cc: public-ixml@w3.org
Message-ID: <87wnc0nefk.fsf@blackmesatech.com>

M Joel Dubinko <micah@dubinko.info> writes:

> In Earley notation, it’s common to use “dot notation”. For example if
> the sequence a b c is partly parsed: a • b c. (It’s actually more
> complicated than this for already-parsed symbols, but that’s possibly
> not germane to this discussion.) Ultimately, in some fashion, the code
> needs to hold a representation equivalent to dot notation.

Yup.

> How do you manage this state with repeat0 / repeat1 expressions? You
> don’t know that you’re done until you try to grab the next item, and
> it fails.

> One option I that you don’t. These expressions can always be
> represented with simpler rules by adding intermediate rules, as in the
> implementation hints section. If this is your implementation path, I’d
> love to hear any details.

As far as I know, all the implementations except Aparecium translate the
user-specified EBNF into an equivalent BNF using either the equivalences
given in the spec or others, so as you conjecture they can use the
conventional Earley items.

> Otherwise, an expression like foo•+ looks odd, and feels more awkward
> to express in code.

For what it's worth, Aparecium turns each production rule into a finite
state automaton, using the Gluschkov construction (as described by Anne
Brüggemann-Klein in one of her papers); the location in the rule is then
given by the identifier for the state, and in the Gluschkov automaton
each basic symbol occurrence in the source regular expression is a
state.

Marking the state by putting a dot after the symbol is unambiguous, and
does not look any odder to me that in conventional Earley notation, but
ymmv.

Michael
-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Tuesday, 26 July 2022 14:21:12 UTC