non blocking RDF/XML and NTriples Parser in Scala (and Java) from Henry Story on 2012-02-10 (public-lod@w3.org from February 2012)

From: Henry Story <henry.story@bblfish.net>
Date: Sat, 11 Feb 2012 00:05:58 +0100
To: WebID XG <public-xg-webid@w3.org>, Linked Data community <public-lod@w3.org>
Message-Id: <011AF8A2-F91B-4843-9492-B194D9356DBC@bblfish.net>

Hi,

I have been working on getting a non blocking parsers to work. The point of that is that
when you fetch RDF from the web you want to use as few resources as possible. If possible one should
only use a few k of memory even for files that are 1GB long. Async parsing allows one to have 1000s
of open connections simultaneously one only a few threads, also saving on thread costs (0.5-1MB per
thread) For more on what asycn parsing allows one to do see the Jena bug report [1]

I got an async rdf/xml parser going last week using Jena, and wrote a full NTriples one too.
This one using a powerful scala library called nomo . Then this week Alex Bertails published a
Scala library that should allow us to write code to both Jena and Sesame in Scala with very little
overhead. It's called "pimp-my-rdf" [2]

So here are some pointers:

- the RDF/XML parser is using the Jena parser but adapted to non blocking.
https://dvcs.w3.org/hg/read-write-web/file/d9c1f87eee55/src/main/scala/cache/WebFetcher.scala
- The NTriples Parser written from scratch is here
https://github.com/betehess/pimp-my-rdf/blob/master/n-triples-parser/src/main/scala/Parser.scala

It should not be that difficult to write a Turtle parser next. So hopefully I should have that
working soon too.

Henry

[1] More on the Jena bug report
https://issues.apache.org/jira/browse/JENA-203
[2] https://github.com/betehess/pimp-my-rdf
Btw. notice how simple the RDF model is when expressed in Scala
https://github.com/betehess/pimp-my-rdf/blob/master/core/src/main/scala/RDF.scala

Social Web Architect
http://bblfish.net/

Received on Friday, 10 February 2012 23:06:28 UTC