Turtle grammar -- LALR conflict free grammar from Andy Seaborne on 2012-12-18 (public-rdf-wg@w3.org from December 2012)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Tue, 18 Dec 2012 17:27:04 +0000
To: RDF-WG <public-rdf-wg@w3.org>
Message-ID: <50D0A768.9020409@epimorphics.com>

tl:dr:
The LC publication grammar can be implemented conflict-free in LALR.  No 
change to the grammar is needed.

------------------------------

I have made an attempt to mock up a Turtle parser using bison and flex 
in order to investigate:

[[ http://www.w3.org/2011/rdf-wg/meeting/2012-12-12#resolution_2
  1 change from prior publication
  2 that the WG wants "<><><>;;;." to be legal
  3 that we will look for a grammar representation which produces no S/R 
conflicts in LL(1) or LALR(1)
]]

The LC text is the rule:

[7] predicateObjectList  
      ::= verb objectList (';' predicateObjectList?)*

and I split that into two pieces so that the (...)* is a separate rule 
and can be left (or right) recursive.  So I think the LC grammar is 
sufficient.  Efficient LALR grammars are known to be hard to read for 
most people because of the counter-intuitive reduction from right [1].

---------------------
predicateObjectList
     : predicate objectList predicateObjectList2
     ;

predicateObjectList2
     :
     | predicateObjectList2 SEMICOLON
     | predicateObjectList2 SEMICOLON predicate objectList
     ;
---------------------

It's not necessary to split the rule because directly as per LC [7] 
because this seems to work:

---------------------
predicateObjectList
     : predicate objectList
     | predicate objectList  semi
     | predicate objectList  semi  predicateObjectList
     ;

semi : SEMICOLON | semi SEMICOLON
---------------------
but is right-recursive which Eric does not like.  Different writing of 
the grammar into LALR have different effects for the same language.


The full mock up is available at

http://people.apache.org/~andy/c-ttl-3.zip

and it has no conflicts.  There are three versions (ttl0.y & ttl1.y & 
ttl2.y) - the extracts above are ttl1.y and ttl2.y.  ttl1 is harder to 
understand but possibly more efficient at scale, depending on the 
details of the LALR generator.

It is not a complete Turtle parser - it skimps on the tokens but does 
cover all the language features.  It even generates triples (except in 
collections because getting the rdf:first/rdf:rest right is distracting 
from the main point), to test the language actually parsed.

It even has the other key LALR feature - only one error message "syntax 
error" without details.  Any real LALR parser would incorporate 
additional rules to capture error conditions.

 Andy

[1]
http://en.wikipedia.org/wiki/LALR_parser#Implementation_issues

[2]
See also: parser generators that it is claimed accept fairly EBNF 
directly: I have not tried any:

http://en.wikipedia.org/wiki/Comparison_of_parser_generators#Deterministic_context-free_languages

Received on Tuesday, 18 December 2012 17:27:33 UTC