Ixml implementation design review notes

Thinking out loud here, on the occasion of my first unit test passing. Appreciate any thoughts, comments, improvements, corrections, etc. A micro design review of my Rust implementation of ixml.

The sample grammar for this walkthrough is close to as simple as it gets.
    Doc: A, B
    A: “a”
    B: “b”

Influenced by Rust’s strict ownership rules, I’m recording grammars in a way that stores each granular piece once, and only once, and from there refers to immutable pieces by reference. (A form of interning) For clarity, in this message, I’ll use lowercase Roman numerals to reference rules.

Rules from the sample grammar get represented thusly:
Special case EMPTY = Empty()
    i = Nonterm(A)
    ii = Nonterm(B)
    iii = Seq[i, ii]
    iv = LitCharl(“a”)
    v = LitChar(“b”)

In the process of the parse, there will be one additional state that is intermediate (much like a James Clark ‘derivative’ [1]) from rule iii, namely being partway through the sequence. But instead of storing a new rule, it just uses the original rule iii with an additional cursor. So basically Earley dot notation.

In Rust, grammar rules are defined like:

    pub enum Rule {
        Empty,
        LitChar(char),
        LitCharOneOf(SmolStr),
        //LitCharRange(char, char),
        //LitCharUnicodeCat(UnicodeRange),
        //LitStr(SmolStr),
        Nonterm(SmolStr),
        Seq(Vec<RuleRef>),
        OneOf(Vec<RuleRef>),
        //Unmatchable,
    }
As shown by the commented rule types, eventually this set will grow to encompass all the ixml rule types.

[Aside 1: SmolStr is an immutable string that lives on the stack (if < 22 characters) and can be cloned in O(1). ]
[Aside 2: In this first pass, I’m using a fairly heavyweight pass-everything-by-value approach, to focus on getting the logic right. Later, during optimization, I’ll do something more intelligent and Rust-idiomatic with references and/or interior mutability and/or reference-counted storage. ]

Grammar sentences are a name paired with a rule. Thus our sample grammar looks like:
    Doc -> iii
    A -> iv
    B -> v

Processing is divided into Tasks, which are kept in a VecDeque.
 
Using the notation in Steven Pemberton’s XML London 2016 paper, there are 4 types of tasks that happen during parse. For clarity, these are referred to here with capital Roman numerals.
    I: (Completer) Finished rule (find “parent” states at same origin that can produce this rule; compute ‘derivative’ and queue at pos)
    II: (Predictor) Nonterminal rule (dereference and queue downstream rule at pos)
    III: (Scanner) Matched term (compute ‘derivative’ and queue at pos+1)
    IV: Unmatched (pass)

At the moment, I’m handling sequences in their own branch, but the logic is very nearly identical to II. Once things settle down more, I’ll make it pretty.

The parser internal state consists of:

    pub struct HacklesParser {
        queue: VecDeque<Task>,
        trace: HashMap<String, Task>, // cache key is "rulename[origin,pos]" where rulename = str, origin & pos are int
        parentage: HashMap<String, Vec<Task>>, // cache key is "ruleid@origin”
        pos: usize, // input position
}

With some additional work needed when I get to the point of tracking parse forests. And row/col into from the input. And half a dozen other things.

-j

[1] https://relaxng.org/jclark/derivative.html <https://relaxng.org/jclark/derivative.html>

Received on Tuesday, 19 July 2022 19:01:21 UTC