Left-recursive LALR(1) grammar for turtle (ACTION-217) from Sandro Hawke on 2012-12-08 (public-rdf-wg@w3.org from December 2012)

From: Sandro Hawke <sandro@w3.org>
Date: Sat, 08 Dec 2012 10:53:04 -0500
To: Eric Prud'hommeaux <eric@w3.org>, W3C RDF WG <public-rdf-wg@w3.org>
Message-ID: <50C36260.9080401@w3.org>

I believe I've written a strict LALR(1) (aka yacc/bison) grammar for 
Turtle.  It's left-recursive (that is, it doesn't grow the stack) and 
unambiguous.  I haven't done the lexer or a test harness, so I haven't 
actually tested it on the real test suite.   Instead, the lexer just 
uses "words", and I test it with things like "a b c;d e,f".   I hope to 
do the lexer and pass the positive and negative syntax tests, but I 
think what I've got is enough to settle the question we need for 
ACTION-217 and going to CR at the next meeting.

A bit of commentary, while I'm thinking about grammars in specs. 
Basically, there are lots of possible grammars for a language.  In many 
cases, one could write a left-recursive version of a production, which 
would work well in an LALR (bottom-up) parser, or a right-recursive 
version, which would work well in a recursive descent (top-down) 
parser.   Which should we put in a spec?   In either case, there will be 
some parser generators who can't really handle the grammar in the spec.

I implemented this without looking at the spec, so I could keep my brain 
in LALR land.

Jison code below.   Oh, it doesn't do lists, or directives, or a few 
other things that seem irrelevant to ACTION-217.

        -- Sandro

%lex
%%

\s+                   /* skip whitespace */
[a-zA-Z_]+[0-9a-zA-Z_]* return 'WORD'
"."                   return 'DOT'
";"                   return 'SEMI'
","                   return 'COMMA'
"["                   return 'LBRACKET'
"]"                   return 'RBRACKET'
<<EOF>>               return 'EOF'
.                     return 'INVALID'

/lex

%start doc

%%

doc
     : slist EOF
	{ console.log('ended with trailing dot'); }
     | slist stmt EOF
	{ console.log('ended with NO trailing dot'); }
     ;

slist
     : /* nothing */
     | slist stmt DOT
     ;

stmt
     : subject polist
     ;

polist
     : predicate olist
     | polist SEMI
     | polist SEMI predicate olist
     ;

olist
     : object
     | olist COMMA object
     ;

subject
     : term
       { yy.turtle_subject = $1; }
     ;

predicate
     : term
       { yy.turtle_predicate = $1; }
     ;

object
     : term
       { console.log('got triple: ', yy.turtle_subject, yy.turtle_predicate, $1); }
     | bracket_expr
     // | list_expr
     ;

bracket_expr
     : start_bracket_expr polist end_bracket_expr
     ;

start_bracket_expr
     : LBRACKET
       {  if (!yy.bracket_stack) yy.bracket_stack = [];
          yy.bracket_stack.push( [yy.turtle_subject, yy.turtle_predicate] );
	 console.log('pushed to make:', yy.bracket_stack);
	 if (yy.bnode_counter) {
	    yy.bnode_counter++;
	 } else {
	    yy.bnode_counter = 1
          }
	 yy.turtle_subject = '_:u' + yy.bnode_counter;
	 yy.turtle_predicate = undefined;
       }
     ;

end_bracket_expr
     : RBRACKET
       {
          var value = yy.turtle_subject;
          sp = yy.bracket_stack.pop();
          yy.turtle_subject = sp[0];
          yy.turtle_predicate = sp[1];
          console.log('got triple: ', yy.turtle_subject, yy.turtle_predicate, value);
       }
     ;


term
     : WORD   // for now we're not worrying about lexing.
     ;

Received on Saturday, 8 December 2012 15:53:07 UTC