- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Thu, 14 Jul 2022 08:10:54 -0600
- To: M Joel Dubinko <micah@dubinko.info>
- Cc: public-ixml@w3.org
M Joel Dubinko <micah@dubinko.info> writes: > There’s a lot in common between James Clark’s algorithm and Earley > parsing. In particular, the “dot” notation (a la Wikipedia) > (X → α • β, i), consisting of > * the production currently being matched (X → α β) > * the current position in that production (represented by the dot) > * the position i in the input at which the matching of this production > began: the origin position > Sounds an awful lot like the description of a derivative: > "The derivative of a pattern p with respect to a node x is a pattern > for what's left of p after matching x;" > The portion after the dot IS a derivative, right? Are there subtle > differences I’m missing? Is the portion after the dot a derivative? I think the answer is a firm "maybe". Or "sometimes". Or "it depends on what the meaning of is, is". For grammars in BNF, where each production rule has one or more 'alt' elements, each containing a sequence of terms, placing the dot anywhere in a sequence divides the sequence into two subsequences, each of which is a well-formed expression. If the dot is at the beginning or the end, at lease one of the subsequences is empty, but that's also a well-formed sequence of terms. But if a production rule is A = B; C, D, E; F. and we match a C, which of the following do we have in mind as the dot notation for the current parsing state? A = B; C · , D, E; F. A = C · , D, E. If the first, then the part after the dot is , D, E; F which doesn't look like a well formed expression. If it's not, then it cannot be a derivative. Perhaps it's obvious that the leading comma should be deleted, but then we have a well-formed expression D, E; F which does NOT describe what remains to be matched: the original does not foresee an F following a C. If the second is what we have in mind as the dot notation for the current parsing state, then what happened to the B and the F? In the case of production rules that are not in BNF form but take the form of arbitrary expressions over the grammar's vocabulary, the dot notation is pretty clearly another way to identify "what remains to be parsed", but it's also pretty clearly not a derivative in the sense of being a stand-alone expression that describes the remaining suffix. For example, consider the production rule for 'alt' in the ixml spec grammar: alt: term**(-",", s). If we read a term, the dot notation would be (at least, given what I think is the only plausible way to apply dot notation to EBNF) alt: term · **(-",", s). But the part to the right of the dot is not meaningful as an expression and is not really the derivative. I would write the derivative of term**(-",", s) with respect to term as (-",", s), term**(-",", s) > I’m thinking that Rust, with it’s nice Haskell-like “data class” > enums, might be able to adapt some of the logic from [1] to make a > very clean, readable implementation. I think that might well be so. At an abstract level (or: *an* abstract level, possibly one of many), an Earley item can be described as a 4-tuple (x, y, N, L), where x and y are locations in the input (e.g. integers between 0 and |input| inclusive), N is a nonterminal in the grammar, and L is a location in the production rule for N. From N and L we can see what remains to be parsed and we can, on demand, produce an expression for that, which is a derivative in the strict sense. From N and a sequence of symbols in the vocabulary of the grammar, we can calculate either a location in the production rule for N or the derivative of the rule for N with respect to that sequence of symbols. From N and L, or from a derivative, we can also answer questions like "what could possibly come next" and "have we in fact already recognized a complete N?" (i.e. "if nothing else comes next, are we done?"). And so on. I cannot think of anything you can do with the pair (N, L) that you can't do with the pair (N, D), where D is a derivative of the rule for N, with respect to some sequence of symbols. So while I don't think that the dot notation *is* a derivative, or vice versa, I think they are equivalent for all the purposes I have thought of so far. A longer and perhaps more comprehensible attempt to describe Earley parsing at an abstract non-procedural level can be found in a paper I gave at Balisage a few years ago; I'll dig out the reference if you can't find it. Michael -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Received on Thursday, 14 July 2022 14:40:30 UTC