- From: Henry Story <henry.story@bblfish.net>
- Date: Wed, 29 Feb 2012 22:20:14 +0100
- To: public-rdf-comments@w3.org
- Cc: Alexandre Bertalis <bertails@w3.org>
Thanks all for your answers to the questions I put recently on this list.
They helped me to finished the Scala parser: it passes all the official
w3c tests (bar one).
For those of you interested the main code for the parser is here.
https://github.com/betehess/pimp-my-rdf/blob/d64ae11514f4bd8402c0857cb29c203ec821bd67/n3/src/main/scala/Turtle.scala
It is written following very closely the spec - indeed it might seem to be nearly a
statement for statement transposition of the spec's EBNF (just upside down). It is
asynchrnous, and should use only as much memory as needed. I am sure there is a lot
more to do on optimising efficiency still, but this is good enough for me right now.
1. EBNF change for '.'
----------------------
There is one change to the spec I would like to argue for. The current EBNF has the
following rules for prefixed names such as foaf:knows
PrefixedName ::= PNAME_LN | PNAME_NS
<PNAME_NS> ::= (PN_PREFIX)? ":"
<PNAME_LN> ::= PNAME_NS PN_LOCAL
<PN_PREFIX> ::= PN_CHARS_BASE ( ( PN_CHARS | "." )* PN_CHARS )?
<PN_LOCAL> ::= ( PN_CHARS_U | [0-9] ) ( ( PN_CHARS | "." )* PN_CHARS )?
My issue is with the definitions of PN_PREFIX and PN_LOCAL. Both of those
are just really nasty, and I don't think they give much value. They are nasty
because one has a rule where you have a number of ( PN_CHARS | "." )* followed by
the same PN_CHARS minus the dot. This is aimed at allowing people to write prefixed
names such as
foaf.duck:quack
but without allowing
foaf.duck:quack.
That last dot is reserved for end of sentences. I spent a lot of time trying
to implement this. Alex Hall wrote that he had trouble with this
> FWIW, I had trouble implementing the same PN_PREFIX rule that you cite above using Antlr, and had to use Antlr's predicated production feature to work around the greediness. So I rewrote the rule as:
>
> fragment PN_LOCAL_CHARS : '.' | PN_CHARS ;
> fragment PN_CHARS_SEQ :
> ( ('.' PN_LOCAL_CHARS)=> '.' // '.' is not allowed at the end -- only match them if they're followed by another valid char
> | PN_CHARS )* ;
> fragment PN_PREFIX : PN_CHARS_BASE PN_CHARS_SEQ ;
>
Currently I just disallowed dots in the names, which gave me the very simple rule
lazy val PN_PREFIX = (PN_CHARS_BASE ++ PN_CHARS.many)
I could try to spend time implementing the dotted names, but I'd rather argue against
it. I really doubt that people make a big use of dotted names when writing rdf by hand.
I think it can make the turtle less readable, and it also clashes with the '.' notation
in n3 (thought that may have it's own problems). i.e. we just have
<PN_PREFIX> ::= PN_CHARS_BASE PN_CHARS*
<PN_LOCAL> ::= ( PN_CHARS_U | [0-9] ) PN_CHARS*
2. Fixes to test suite
----------------------
I found a few bugs in the test suites. The diffs can be found here:
https://github.com/betehess/pimp-my-rdf/commits/master/n3-test-suite/src/main/resources/www.w3.org/TR/turtle/tests
I added a test for <> as that cought me out.
3. TODO
-------
The code is open source. I tested it against Jena and Sesame using the framework
https://github.com/betehess/pimp-my-rdf/blob/master/n3-test-suite/src/main/scala/TurtleParserTest.scala
(When testing against Jena there seem to be more bugs, perhaps something related
to bnode creation.)
I am sure this can be optimised still a lot further. But it should be good enough for me
at present. I welcome anyone to try it out and do some speed tests on it, and see
what optimisations can be made.
All the best,
Henry
Social Web Architect
http://bblfish.net/
Received on Wednesday, 29 February 2012 21:20:51 UTC