two solutions to Steven's challenge

The challenge as I interpreted it:  Write a grammar such that

(1) The following are all accepted as comments:

    <test-input>(**)</test-input>
    <test-input>(***)</test-input>
    <test-input>(****)</test-input>
    <test-input>(*****)</test-input>
    <test-input>(*abc*)</test-input>
    <test-input>(**abc*)</test-input>
    <test-input>(*abc**)</test-input>
    <test-input>(*abc*abc*)</test-input>
    <test-input>(**abc*abc*)</test-input>
    <test-input>(*abc**abc*)</test-input>
    <test-input>(*abc*abc**)</test-input>
    <test-input>(*abc* )(*abc*)</test-input>
    <test-input>(***a*)</test-input>
    <test-input>(**a*)</test-input>
    <test-input>(**a*a*)</test-input>
    <test-input>(**aa*)</test-input>
    <test-input>(*a*)</test-input>

(2) The following are all rejected as comments:

    <test-input>(*abc*)(*abc*)</test-input>
    <test-input>(</test-input>
    <test-input>(*</test-input>
    <test-input>(**</test-input>
    <test-input>(***</test-input>
    <test-input>(**a</test-input>
    <test-input>(**a*</test-input>
    <test-input>(**aa</test-input>
    <test-input>(*a</test-input>
    <test-input>(*a*</test-input>
    <test-input>(*aa</test-input>
    <test-input>a</test-input>
    <test-input>)</test-input>
    <test-input>*)</test-input>
    <test-input>**)</test-input>

The first item on this list is Steven's last example.  I made it a
negative case because it is not one comment, and I did not want to
bother trying to write a grammar for a series of comments interspersed
with non-comment data.

If you wish to solve this challenge independently, stop reading now;
spoiler alert.







My first solution is fairly straightforward:

    comment = '(*', comment-element, '*)'.
    -comment-element = (nonstar | ssbnsc)*, star*.
    -nonstar = ~['*'].
    -star = '*'.
    { star-something, but not star-closeparen }
    -ssbnsc = '*'+, ~['*)'].

Since it's not recursive, it can readily be reduced to a single
production:

    comment = '(*', (~['*'] | ('*'+, ~['*)']))*, '*'*, '*)'.

(Warning: I have not tested the single-production version.  It appears
to expose a bug in Aparecium.  Thank you, Steven!  Either that, or there
is something wrong with my manual simplification that I am not seeing.)


In order to write grammars for things like this, I find myself drawing a
finite-state automaton on paper and turning it into a regular expression
like the RHS of 'comment-element' above.  A simpler approach is to draw
a finite-state automaton and instead of turning it into a regular
expression, just turn it into a regular grammar.  An FSA for
Pascal-style comments is attached as a PNG image:
This can be turned into a regular grammar for comments:

    { Pascal-style comments.
      This is a regular language and does not require context-free power,
      so in fact a simple way to design it is as a finite state automaton,
      which can be transcribed as a regular grammar.
      }

    comment = '(*', q1.

    { At q1, we have just read '(' }
    -q1 = '*', q2
        | ')', qf
        | ~['*'], q3.

    { At q2, we have just read an asterisk. }
    -q2 = '*', q2
        | ')', qf
        | ~['*)'], q3.

    { At q3, we have just read something other than an asterisk or '*)' }
    -q3 = '*', q2
        | ~['*'], q3.

    { At qf, we are done. }
    -qf = {final}.


I find it interesting that Aparecium runs much faster on the first
grammar given than on the regular grammar; I suspect that reflects slow
algorithms in the tree-construction part of my code.

Michael

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com

Received on Friday, 4 February 2022 01:22:26 UTC