Re: Grammars are not Regexps from Steven Pemberton on 2022-02-07 (public-ixml@w3.org from February 2022)

From: Steven Pemberton <steven.pemberton@cwi.nl>
Date: Mon, 07 Feb 2022 13:26:33 +0000
To: ixml <public-ixml@w3.org>
Message-Id: <1644234515120.3313171841.1813621065@cwi.nl>

So my solution was:


 comments: (comment, s?)+.
 -s: -[" "; #a; #9].


 comment: "(*", content, ")".
 -content: (c*, "*"+)+~["*)"].
 -c: ~["*"].


I consider the interesting bit to be the last "*" in the rule for content, 
which is only there to force the earlier "*"+ to match the maximal number 
of asterisks.
So
 (c*, "*"+)+~["*)"]
finds zero or more non asterisks, followed by one or more asterisks. If the 
next character is not a closing bracket, it does it again.


If I expand the contained rules, it looks like 


 comment: "(*", (~["*"]*, "*"+)+~["*)"], ")".


Michael's solution is slightly longer:


 comment: '(*', (~['*'] | ('*'+, ~['*)']))*, '*'*, -'*)'.


but has the pleasant property of starting and ending with the comment 
delimiters, meaning you could write:


 comments: (pcomment, s?)+.
 -s: -[" "; #a; #9].
 -pcomment: -'(*', comment, -'*)'.
 comment: (~['*'] | ('*'+, ~['*)']))*, '*'*.


giving on my test set the output of:


<comments>
    <comment/>
    <comment>*</comment>
    <comment>**</comment>
    <comment>***</comment>
    <comment>abc</comment>
    <comment>*abc</comment>
    <comment>abc*</comment>
    <comment>abc*abc</comment>
    <comment>*abc*abc</comment>
    <comment>abc**abc</comment>
    <comment>abc*abc*</comment>
    <comment>abc* )(*abc</comment>
    <comment>abc</comment>
    <comment>abc</comment>
</comments>




Norm's solution


 comment: -'(*', body, -'*)' .
 -body: ~[")"]* ; ~['*'], [")"] .


is very nice, but fails on one test case:


 (*abc* )(*abc*)


(also note that [")"] can be simplified to ")")


but we can fix that with:


 comment: "(*", body, "*)".
 -body: (~[")"];~["*"],")")*. 


which I think wins the prize.


Steven

Received on Monday, 7 February 2022 13:26:49 UTC