Re: Please make sure the grammar is directly machine consumable. from stan.devitt@agfa.com on 2005-08-22 (public-rdf-dawg-comments@w3.org from August 2005)

From: <stan.devitt@agfa.com>
Date: Mon, 22 Aug 2005 04:03:34 -0400
To: andy.seaborne@hp.com
Cc: Richard Newman <holygoat@gmail.com>, public-rdf-dawg-comments@w3.org, public-rdf-dawg-comments-request@w3.org, Yosi Scharf <syosi@mit.edu>, Tim Berners-Lee <timbl@w3.org>
Message-ID: <OF41B3E3B0.45C51074-ON85257064.006BEDC9-85257065.002C45BB@agfa.com>
This too is just a comment on the auto-generation of the parser.

For work with Euler Jos and I  have been using a representation of the n3 
grammar that is written  in an XML document structure we have developed to 
 automate the construction and validation of a jjtree grammar.

There are  additional structures in the XML representation of the grammar 
for automatically substituting  groups of tokens  or productions  and  for 
excluding individual entries from such  a group.  It also automatically 
generates   code for visitors to the nodes of the parse tree, and produces 
very readable sorted jjtree grammars.  This has approach been used to 
develop a latex to XML parser, for example, and was originally adopted as 
the hand coding of that grammar had become unmanageable.

XSLT is used to transform this XML grammar into an expanded and sorted 
jjtree grammar which is  in turn transformed into  a javacc grammar 
finally into a java parser complete with a visitor that generates XML 
documents.  Hyperlinked html previews of the grammar are also produced 
with undefined references color coded. 

The resulting parser with the default visitor produces an XML 
representation of the parse tree of the n3 document . There is provision 
for automatically applying  an XSLT stylesheet  to restructure the output 
(one of which turns it back into n3)

The grammar we use differs slightly from the tail-recursive master grammar 
in that we use repetition with semantic lookahead in place of the tail 
recursive parts of the grammar.  We find the resulting parse tree much 
more useful.  It avoids the deeply nested structures caused by  the 
tail-recursions of the original grammar and which usually need to be 
unraveled anyway before having a usable structure.   (Of course we could 
stick with the tail recursion and unravel later it in XSLT :-)  ) 
 
Our heavy reliance on semantic lookahead has had only a modest effect on 
the performance.  It turns out all the lookahead expressions tend to fail 
early except where we are looking for ( for example)  an optional "." at 
the end of a group  such as {a b c. d e f. } .    Even this has been 
manageable.

Auto-generation of the parser from n3 is certainly in our plans.  We would 
generate our XML form of the grammar directly from the n3 grammar, 
systematically transforming the tail recursion into our structure either 
using n3 rules or xslt as is appropriate.

Stan Devitt





"Seaborne, Andy" <andy.seaborne@hp.com>
Sent by: public-rdf-dawg-comments-request@w3.org
08/21/2005 12:38 PM

 
        To:     Richard Newman <holygoat@gmail.com>
        cc:     Tim Berners-Lee <timbl@w3.org>, public-rdf-dawg-comments@w3.org, Yosi 
Scharf <syosi@mit.edu>, (bcc: Stan Devitt/AWKCT/CAN/AGFA/CA/BAYER)
        Subject:        Re: Please make sure the grammar is directly machine consumable.





Richard Newman wrote:
> 
> 
> On 19 Aug 2005, at 04:12, Tim Berners-Lee wrote:
> 
>> Richard,
>>
>> I didn't realize the grammar in the spec is machine-generated.
>> Maybe it should be hand-edited and everything else
>> generated from it.
> 
> 
> I think that would be a good idea from one point of view (mine and 
> yours, certainly!), but we'd have to see what the current maintainers 
> of the SPARQL grammar think.
> 
>> Yosi (on vacation right now) has generated (with a small hand tweak)
>> the CFG grammar in RDF from the spec.   (See sparql* in
>> http://www.w3.org/2000/10/swap/grammar/
>> )  This is in plain BNF (  cfg:mustBeOneSequence properties
>> with nested RDF collections )
>>
>> See the bnf.n3 ontology in that directory as well as
>> the bnf-rules.n3 which go from some forms of ebnf to bnf,
>> also in that directory.
> 
> 
> Very handy (and pretty cool!). As it seems the tools are in place, it 
> would be nice to have a machine-readable 'spec' grammar that could be 
> re-purposed into presentation EBNF, JavaCC, plain BNF, etc. -- this 
> would certainly save me a lot of work whenever the grammar changes!
> 
> It is also nice, in an "eating one's own dog food" way, to have the 
> grammar itself in RDF.
> 
> -R
> 

This is not a response to the comment - just a description of some details 

in case it helps.

The grammar is written using JavaCC, which, while an LL parser generator, 
also provides tools to do LA checking.  JavaCC also provides a text output 

format.

The JavaCC text output is converted to the HTML for the document by a 
script 
although the tokens have to be manually described.  The process is 
converting javacc syntax to the EBNF syntax as described in
http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-notation.

The grammar in javacc is not quite LL(1) (there is a 2 state lookahead at 
the Triples production - related to the optional dots Richard commented 
on). 
  The document grammar is also fed into yacker (a W3C tool) which checks 
for 
conversion to bison/flex (LALR(1)).

There are trade-off between readability by humans and processable by 
machines in the current grammar.  Some people find the weighting towards a 

machine-processable grammar makes the grammar unclear (e.g. the use of 
recursive rules use rather than repetition).

                 Andy
Received on Monday, 22 August 2005 08:04:18 UTC