Re: Fwd: Light-weight streaming PHP library for RDF serialization? from Michel Dumontier on 2013-01-29 (semantic-web@w3.org from January 2013)

From: Michel Dumontier <michel.dumontier@gmail.com>
Date: Tue, 29 Jan 2013 16:40:09 -0500
To: Nicholas J Humfrey <njh@aelius.com>
Cc: semantic-web@w3.org
Message-ID: <CALcEXf4JTLo2EpiQjb-5fM7b-d8dc6T+Jn2F4wfp-vwe65CGyw@mail.gmail.com>
Nicholas,
 Bio2RDF scripts [1] process mixed media - flat files, tab files, xml files
- into RDF [2], and this is coordinated with a common API [3]. Some of
these files are rather large, so it's important that we occasionally
serialize the contents otherwise we'll run out of space. Since in only very
rare times do we actually want to query the model, we don't need it in
memory, but just keep an simple index if necessary. We also need support
for generating n-quads.  Although I've been thinking of redeveloping my api
with easyRDF (while also taking advantage of composer for dependency
management and sami for documentation), streaming and nquad support are
essential requirements.

m.

[1] https://github.com/bio2rdf/bio2rdf-scripts
[2] https://github.com/bio2rdf/bio2rdf-scripts/wiki
[3] https://github.com/micheldumontier/php-lib




On Tue, Jan 29, 2013 at 3:42 PM, Nicholas J Humfrey <njh@aelius.com> wrote:

> Hello Denny,
>
> Sorry, I am not on the semantic-web mailing list - too many emails for me
> to take in and not enough time. Stéphane kindly forwarded your email to
> me.
>
>
> No, it is not currently possible to serialise a triple stream with EasyRdf.
> I took this decision for a number of reasons:
>
> 1) EasyRdf was designed with the BBC's web platform in mind. This
> typically uses Java (and others) as a 'heavy lifting' service layer and
> PHP as a lightweight presentation layer. As such PHP should only have a
> single page worth of data to process at a time - thus streaming was not an
> important requirement.
>
> 2) At the core of the EasyRdf is the a graph model object. EasyRdf started
> off as an object model layer on top of ARC2 (and others). Since ARC2 has
> had less development work done on it, I have been expanding the number of
> native parsers and serialisers in it. I want to avoid making it overly
> complex with multiple APIs for doing similar things (!)
>
> 3) The HTTP client API that I have been using (based on Zend_HTTP_Client,
> which is again what the BBC uses) doesn't support streaming - it loads the
> full response into memory. Therefore there are fewer benefits in EasyRdf
> being able to stream triples.
>
> 4) I have worked hard to try and make the RDF/XML and Turtle
> serialisations as pretty as possible - this involves collecting/sorting
> all the same resources and properties together, so that the document reads
> well. Otherwise you just end up with a triple oriented document that reads
> like N-Triples or Trix. Some implementations (such as Redland) do this
> within the serialiser itself but that seemed like an extra overhead, when
> I already had the data organised like that inside the EasyRdf graph
> object.
>
>
> Having said all of that, some of the serialisers would be fairly easy to
> convert and I would be willing to look at changing the API in order to
> help you with your requirements (I am a big fan of WikiData!). It would
> also make sense to not have multiple PHP libraries for serialising RDF,
> with varying quality and features - I think this is one of the reasons why
> the semantic web hasn't taken off faster.
>
>
> What is your streaming source of triples?
> Are you serialising direct from the database?
> Can the database pre-sort subjects and properties, so they are ready to be
> serialised?
> Is this for a bulk-export or individual API queries?
>
>
> nick.
>
>
> > ---------- Forwarded message ----------
> > From: Denny Vrandečić <denny.vrandecic@wikimedia.de>
> > Date: Tue, Jan 29, 2013 at 11:54 AM
> > Subject: Light-weight streaming PHP library for RDF serialization?
> > To: SW-forum <semantic-web@w3.org>
> >
> >
> > Hi,
> >
> > is there an actively maintained open source pure PHP library that can be
> > used to create RDF serialization from a model?
> >
> > It should be able to stream a big number of triples.
> >
> > Pluspoints if there it has no Parser or SPARQL processing library as a
> > dependency, in order to decrease the size of the library (smaller library
> > =
> > happier code reviewer, less maintenance costs).
> >
> > Cheers,
> > Denny
> >
> > --
> > Project director Wikidata
> > Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
> > Tel. +49-30-219 158 26-0 | http://wikimedia.de
> >
> > Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
> > Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
> > unter
> > der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
> > Körperschaften I Berlin, Steuernummer 27/681/51985.
> >
> >
> >
> > --
> > Steph.
> >
>
>
>
>


-- 
Michel Dumontier
Associate Professor of Bioinformatics, Carleton University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
http://dumontierlab.com
Received on Tuesday, 29 January 2013 21:40:57 UTC