Turtle parser finished - some comments from Henry Story on 2012-02-29 (public-rdf-comments@w3.org from February 2012)

From: Henry Story <henry.story@bblfish.net>
Date: Wed, 29 Feb 2012 22:20:14 +0100
To: public-rdf-comments@w3.org
Cc: Alexandre Bertalis <bertails@w3.org>
Message-Id: <7651BB28-4A18-48CB-828C-F4DAD113AF92@bblfish.net>
Thanks all for your answers to the questions I put recently on this list. 
They helped me to finished the Scala parser:  it passes all the official 
w3c tests (bar one).

For those of you interested the main code for the parser is here.

https://github.com/betehess/pimp-my-rdf/blob/d64ae11514f4bd8402c0857cb29c203ec821bd67/n3/src/main/scala/Turtle.scala

It is written following very closely the spec - indeed it might seem to be nearly a
statement for statement transposition of the spec's EBNF (just upside down). It is
asynchrnous, and should use only as much memory as needed. I am sure there is a lot
more to do on optimising efficiency still, but this is good enough for me right now.

1. EBNF change for '.'
----------------------

There is one change to the spec I would like to argue for. The current EBNF has the
following rules for prefixed names such as foaf:knows

  PrefixedName ::= PNAME_LN | PNAME_NS 
  <PNAME_NS> ::= (PN_PREFIX)? ":" 
  <PNAME_LN> ::= PNAME_NS PN_LOCAL  
  <PN_PREFIX> ::= PN_CHARS_BASE ( ( PN_CHARS | "." )* PN_CHARS )?
  <PN_LOCAL> ::= ( PN_CHARS_U | [0-9] ) ( ( PN_CHARS | "." )* PN_CHARS )? 


My issue is with the definitions of PN_PREFIX and PN_LOCAL. Both of those
are just really nasty, and I don't think they give much value. They are nasty
because one has a rule where you have a number of ( PN_CHARS | "." )* followed by 
the same PN_CHARS minus the dot. This is aimed at allowing people to write prefixed
names such as 

   foaf.duck:quack 

but without allowing 

   foaf.duck:quack. 

That last dot is reserved for end of sentences. I spent a lot of time trying
to implement this. Alex Hall wrote that he had trouble with this

> FWIW, I had trouble implementing the same PN_PREFIX rule that you cite above using Antlr, and had to use Antlr's predicated production feature to work around the greediness. So I rewrote the rule as:
> 
> fragment PN_LOCAL_CHARS : '.' | PN_CHARS ;
> fragment PN_CHARS_SEQ :
>    ( ('.' PN_LOCAL_CHARS)=> '.' // '.' is not allowed at the end -- only match them if they're followed by another valid char
>    | PN_CHARS )* ;
> fragment PN_PREFIX : PN_CHARS_BASE PN_CHARS_SEQ ;
> 

Currently I just disallowed dots in the names, which gave me the very simple rule

 lazy val PN_PREFIX  =  (PN_CHARS_BASE ++ PN_CHARS.many)

I could try to spend time implementing the dotted names, but I'd rather argue against 
it. I really doubt that people make a big use of dotted names when writing rdf by hand.
I think it can make the turtle less readable, and it also clashes with the '.' notation
in n3 (thought that may have it's own problems). i.e. we just have

 <PN_PREFIX> ::= PN_CHARS_BASE  PN_CHARS*
 <PN_LOCAL> ::= ( PN_CHARS_U | [0-9] )  PN_CHARS*

2. Fixes to test suite
----------------------

I found a few bugs in the test suites. The diffs can be found here:
   
https://github.com/betehess/pimp-my-rdf/commits/master/n3-test-suite/src/main/resources/www.w3.org/TR/turtle/tests

I added a test for <> as that cought me out.

3. TODO
-------

The code is open source. I tested it against Jena and Sesame using the framework 
https://github.com/betehess/pimp-my-rdf/blob/master/n3-test-suite/src/main/scala/TurtleParserTest.scala

(When testing against Jena there seem to be more bugs, perhaps something related
to bnode creation.)

I am sure this can be optimised still a lot further. But it should be good enough for me
at present. I welcome anyone to try it out and do some speed tests on it, and see
what optimisations can be made.

	All the best,

		Henry



Social Web Architect
http://bblfish.net/
Received on Wednesday, 29 February 2012 21:20:51 UTC