W3C home > Mailing lists > Public > public-rdf-comments@w3.org > February 2012

Turtle parser finished - some comments

From: Henry Story <henry.story@bblfish.net>
Date: Wed, 29 Feb 2012 22:20:14 +0100
Cc: Alexandre Bertalis <bertails@w3.org>
Message-Id: <7651BB28-4A18-48CB-828C-F4DAD113AF92@bblfish.net>
To: public-rdf-comments@w3.org
Thanks all for your answers to the questions I put recently on this list. 
They helped me to finished the Scala parser:  it passes all the official 
w3c tests (bar one).

For those of you interested the main code for the parser is here.


It is written following very closely the spec - indeed it might seem to be nearly a
statement for statement transposition of the spec's EBNF (just upside down). It is
asynchrnous, and should use only as much memory as needed. I am sure there is a lot
more to do on optimising efficiency still, but this is good enough for me right now.

1. EBNF change for '.'

There is one change to the spec I would like to argue for. The current EBNF has the
following rules for prefixed names such as foaf:knows

  PrefixedName ::= PNAME_LN | PNAME_NS 
  <PNAME_NS> ::= (PN_PREFIX)? ":" 
  <PN_LOCAL> ::= ( PN_CHARS_U | [0-9] ) ( ( PN_CHARS | "." )* PN_CHARS )? 

My issue is with the definitions of PN_PREFIX and PN_LOCAL. Both of those
are just really nasty, and I don't think they give much value. They are nasty
because one has a rule where you have a number of ( PN_CHARS | "." )* followed by 
the same PN_CHARS minus the dot. This is aimed at allowing people to write prefixed
names such as 


but without allowing 


That last dot is reserved for end of sentences. I spent a lot of time trying
to implement this. Alex Hall wrote that he had trouble with this

> FWIW, I had trouble implementing the same PN_PREFIX rule that you cite above using Antlr, and had to use Antlr's predicated production feature to work around the greediness. So I rewrote the rule as:
> fragment PN_LOCAL_CHARS : '.' | PN_CHARS ;
> fragment PN_CHARS_SEQ :
>    ( ('.' PN_LOCAL_CHARS)=> '.' // '.' is not allowed at the end -- only match them if they're followed by another valid char
>    | PN_CHARS )* ;

Currently I just disallowed dots in the names, which gave me the very simple rule

 lazy val PN_PREFIX  =  (PN_CHARS_BASE ++ PN_CHARS.many)

I could try to spend time implementing the dotted names, but I'd rather argue against 
it. I really doubt that people make a big use of dotted names when writing rdf by hand.
I think it can make the turtle less readable, and it also clashes with the '.' notation
in n3 (thought that may have it's own problems). i.e. we just have

 <PN_LOCAL> ::= ( PN_CHARS_U | [0-9] )  PN_CHARS*

2. Fixes to test suite

I found a few bugs in the test suites. The diffs can be found here:

I added a test for <> as that cought me out.


The code is open source. I tested it against Jena and Sesame using the framework 

(When testing against Jena there seem to be more bugs, perhaps something related
to bnode creation.)

I am sure this can be optimised still a lot further. But it should be good enough for me
at present. I welcome anyone to try it out and do some speed tests on it, and see
what optimisations can be made.

	All the best,


Social Web Architect
Received on Wednesday, 29 February 2012 21:20:51 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:29:53 UTC