Re: Ixml implementation design review notes from C. M. Sperberg-McQueen on 2022-07-20 (public-ixml@w3.org from July 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Tue, 19 Jul 2022 18:31:11 -0600
To: M Joel Dubinko <micah@dubinko.info>
Cc: public-ixml@w3.org
Message-ID: <87lesofw0d.fsf@blackmesatech.com>
M Joel Dubinko <micah@dubinko.info> writes:

> Thinking out loud here, on the occasion of my first unit test
> passing. Appreciate any thoughts, comments, improvements, corrections,
> etc. A micro design review of my Rust implementation of ixml.

Interesting; thank you for this.

> The sample grammar for this walkthrough is close to as simple as it gets.
>     Doc: A, B
>     A: “a”
>     B: “b”

> Influenced by Rust’s strict ownership rules, I’m recording grammars in
> a way that stores each granular piece once, and only once, and from
> there refers to immutable pieces by reference. (A form of interning)
> For clarity, in this message, I’ll use lowercase Roman numerals to
> reference rules.

> Rules from the sample grammar get represented thusly:

Here and below you seem to use 'rule' to refer to what I would call
expressions -- the kinds of thing which can occur on the right-hand side
of a rule.  Of course, there is no requirement that your internal
terminaology match the terms used by others, but referring to

  A; B

as a rule instead of as an expression is going at some point to confuse
me and lead to a discussion in which we talk at cross purposes for a
while before realizing we have misunderstood each other.  

> Special case EMPTY = Empty()
>     i = Nonterm(A)
>     ii = Nonterm(B)
>     iii = Seq[i, ii]
>     iv = LitCharl(“a”)
>     v = LitChar(“b”)

You mention interning, which I take in the sense I learned it for Lisp
and Scheme, which guarantees that two uses of the same identifier will
turn into two references to the same object in memory (unless the
programmer is playing tricks with the environment).  OK.

Are you interning only literals and nonterminals, or will structures
like Seq() and OneOf() also be interned, so that in a grammar like

  S = A, B.
  A = 'a'; ('bc').
  B = 'a'; ('bc').

the production rules for A and B will point to the same OneOf() object?
I'm just wondering about the cost.

> In the process of the parse, there will be one additional state that
> is intermediate (much like a James Clark ‘derivative’ [1]) from rule
> iii, namely being partway through the sequence. But instead of storing
> a new rule, it just uses the original rule iii with an additional
> cursor. So basically Earley dot notation.

> In Rust, grammar rules are defined like:
>
>     pub enum Rule {
>         Empty,
>         LitChar(char),
>         LitCharOneOf(SmolStr),
>         //LitCharRange(char, char),
>         //LitCharUnicodeCat(UnicodeRange),
>         //LitStr(SmolStr),
>         Nonterm(SmolStr),
>         Seq(Vec<RuleRef>),
>         OneOf(Vec<RuleRef>),
>         //Unmatchable,
>     }

> As shown by the commented rule types, eventually this set will grow to
> encompass all the ixml rule types.

I think you may be missing insertions (of the form +"literal), unless
you are expecting to treat them as LitChar().

Am I right to read this as indicating you plan to level the differences
between

  ['a'; 'b'; '0'-'9']

and

  ('a'; 'b'; '0'-'9')

by representing them both as a OneOf() structure?

Hmm.  Now that I think about it, I think your list may also be missing
exclusions; at least, I don't see how to express ~['{}'] using the
structures given.

Michael

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
Received on Wednesday, 20 July 2022 00:50:22 UTC