Re: design question? or puzzle?

> ...
>>> First, consider a simple grammar for arithmetic expressions:
>>> expression: sum.
>>> sum: product+addop.
>>> product: factor+mulop.
>>> factor: number; identifier; ‘(‘, expression, ‘).
>>> addop: ‘+’; ‘-‘.
>>> mulop: ‘/‘; ‘*’; ‘×’; ‘÷’.
>>> …
>>> An XML representation of this, as an ixml parser would produce it, works nicely to exhibit the structure of an expression like 2 x^3 + 17 x^2 -5x + 7:
> The above grammar neither recognises this expression, nor produces the following serialisation, since it doesn't accept an implicit multiply sign, nor convert x^2 to x x.

True; I was being sloppy.

>> …  you can elide all of the unnecessary bits: you never need product or factor, since they are there for syntactic reasons, not semantic:
> expression: expr. 
> -expr: term; sum.
> sum: term, "+", term+"+".
> -term: factor; prod.
> prod: factor, "×", factor+"×".
> -factor: id; number; "(", expr, ")".
> id: ["a"-"z"]+.
> number: ["0"-"9"]+, (".", ["0"-"9"]+)?.
> would give
> <expression><number>3</number></expression>

After hitting Send on my original mail (amazing to me, how hitting Send suddenly produces spurts of creative ideas on how to solve the problem you’ve just asked other people for help with), I ended up with the following.  It may be possible to automate this, but it might be tricky in the general case.

      -expression:  Xsum.
      -Xsum:  sum; ghost-sum. 
      sum:  Xproduct, addop, Xproduct+addop. 
      -ghost-sum:  Xproduct. 
      -Xproduct:  product; ghost-product. 
      product:  factor, multop, factor+mulop. 
      -ghost-product:  factor. 
      -factor:  number; identifier; ‘(‘, expression, ‘).
      addop:  ‘+’; ‘-‘.
      mulop: ‘/‘; ‘*’; ‘×’; ‘÷’.

This can be derived systematically thus:

1 Every nonterminal in the original grammar that never takes more than one child is marked - (that is expression and factor).

2 For every nonterminal N that can take one or more children, first all right-hand side references to N are changed to read XN, and then the rule for N is split into three rules: one (named N) for the case of two or more children, one (here named ghost-N) for the single-child case, and one (here named XN) which is defined a either N or ghost-N.  

This is just cumbersome enough that in many cases I think I will prefer to parse using the original grammar and then post-process using XSLT (as Liam suggested in the first place).  But if for some reason that’s not a good option, and I really want the parser to produce trees without unit rules, then it’s nice to know that there is a way to do it ‘in the camera’, as Tom put it, rather than only ‘in the darkroom’.


C. M. Sperberg-McQueen
Black Mesa Technologies LLC

Received on Wednesday, 14 April 2021 01:20:20 UTC