Re: Fwd: Light-weight streaming PHP library for RDF serialization? from Nicholas J Humfrey on 2013-01-29 (semantic-web@w3.org from January 2013)

From: Nicholas J Humfrey <njh@aelius.com>
Date: Tue, 29 Jan 2013 22:43:15 -0000
To: "Michel Dumontier" <michel.dumontier@gmail.com>
Cc: semantic-web@w3.org
Message-ID: <9ec44c73ab7cd86a54f1bdc6957f00f2.squirrel@www.aelius.com>
Hello Michel,

Hopefully we can work something out.

Which output formats do you require streaming for?
If you were able to stream n-triples/n-quads would that be enough?

Would an API like this work for you?
$stream->writeTriple($subject, $predicate, $object, $graph);


nick.


> Nicholas,
>  Bio2RDF scripts [1] process mixed media - flat files, tab files, xml
> files
> - into RDF [2], and this is coordinated with a common API [3]. Some of
> these files are rather large, so it's important that we occasionally
> serialize the contents otherwise we'll run out of space. Since in only
> very
> rare times do we actually want to query the model, we don't need it in
> memory, but just keep an simple index if necessary. We also need support
> for generating n-quads.  Although I've been thinking of redeveloping my
> api
> with easyRDF (while also taking advantage of composer for dependency
> management and sami for documentation), streaming and nquad support are
> essential requirements.
>
> m.
>
> [1] https://github.com/bio2rdf/bio2rdf-scripts
> [2] https://github.com/bio2rdf/bio2rdf-scripts/wiki
> [3] https://github.com/micheldumontier/php-lib
>
>
>
>
> On Tue, Jan 29, 2013 at 3:42 PM, Nicholas J Humfrey <njh@aelius.com>
> wrote:
>
>> Hello Denny,
>>
>> Sorry, I am not on the semantic-web mailing list - too many emails for
>> me
>> to take in and not enough time. Stéphane kindly forwarded your email to
>> me.
>>
>>
>> No, it is not currently possible to serialise a triple stream with
>> EasyRdf.
>> I took this decision for a number of reasons:
>>
>> 1) EasyRdf was designed with the BBC's web platform in mind. This
>> typically uses Java (and others) as a 'heavy lifting' service layer and
>> PHP as a lightweight presentation layer. As such PHP should only have a
>> single page worth of data to process at a time - thus streaming was not
>> an
>> important requirement.
>>
>> 2) At the core of the EasyRdf is the a graph model object. EasyRdf
>> started
>> off as an object model layer on top of ARC2 (and others). Since ARC2 has
>> had less development work done on it, I have been expanding the number
>> of
>> native parsers and serialisers in it. I want to avoid making it overly
>> complex with multiple APIs for doing similar things (!)
>>
>> 3) The HTTP client API that I have been using (based on
>> Zend_HTTP_Client,
>> which is again what the BBC uses) doesn't support streaming - it loads
>> the
>> full response into memory. Therefore there are fewer benefits in EasyRdf
>> being able to stream triples.
>>
>> 4) I have worked hard to try and make the RDF/XML and Turtle
>> serialisations as pretty as possible - this involves collecting/sorting
>> all the same resources and properties together, so that the document
>> reads
>> well. Otherwise you just end up with a triple oriented document that
>> reads
>> like N-Triples or Trix. Some implementations (such as Redland) do this
>> within the serialiser itself but that seemed like an extra overhead,
>> when
>> I already had the data organised like that inside the EasyRdf graph
>> object.
>>
>>
>> Having said all of that, some of the serialisers would be fairly easy to
>> convert and I would be willing to look at changing the API in order to
>> help you with your requirements (I am a big fan of WikiData!). It would
>> also make sense to not have multiple PHP libraries for serialising RDF,
>> with varying quality and features - I think this is one of the reasons
>> why
>> the semantic web hasn't taken off faster.
>>
>>
>> What is your streaming source of triples?
>> Are you serialising direct from the database?
>> Can the database pre-sort subjects and properties, so they are ready to
>> be
>> serialised?
>> Is this for a bulk-export or individual API queries?
>>
>>
>> nick.
>>
>>
>> > ---------- Forwarded message ----------
>> > From: Denny Vrandečić <denny.vrandecic@wikimedia.de>
>> > Date: Tue, Jan 29, 2013 at 11:54 AM
>> > Subject: Light-weight streaming PHP library for RDF serialization?
>> > To: SW-forum <semantic-web@w3.org>
>> >
>> >
>> > Hi,
>> >
>> > is there an actively maintained open source pure PHP library that can
>> be
>> > used to create RDF serialization from a model?
>> >
>> > It should be able to stream a big number of triples.
>> >
>> > Pluspoints if there it has no Parser or SPARQL processing library as a
>> > dependency, in order to decrease the size of the library (smaller
>> library
>> > =
>> > happier code reviewer, less maintenance costs).
>> >
>> > Cheers,
>> > Denny
>> >
>> > --
>> > Project director Wikidata
>> > Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
>> > Tel. +49-30-219 158 26-0 | http://wikimedia.de
>> >
>> > Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens
>> e.V.
>> > Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
>> > unter
>> > der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt
>> für
>> > Körperschaften I Berlin, Steuernummer 27/681/51985.
>> >
>> >
>> >
>> > --
>> > Steph.
>> >
>>
>>
>>
>>
>
>
> --
> Michel Dumontier
> Associate Professor of Bioinformatics, Carleton University
> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest
> Group
> http://dumontierlab.com
>
Received on Tuesday, 29 January 2013 22:43:38 UTC