Use case

I'm not sure if my use case is ready for summarizing on the wiki, but I
wanted to share it before tomorrow's call. I've added some links to blog
posts and software packages at the end of the email.

The examples I've seen so far on the wiki assume that buckets of RDF
triples can be ordered in time. This is reasonable when working with
realtime data as it is or was generated. I'd like to see temporal ordering
as a specific case of a more general situation: processing buckets of RDF
triples in some asynchronous or batched fashion.

This would allow the existing use cases mentioned on the wiki while also
allowing more general stream-based processing of triples. We could relax
the requirement that the buckets be reproducibly sequenced if we don't
require reproducibility of intermediate results.

I'd like to see some way to have a SPARQL query stream results as they are
found (a push model) and/or a recommendation for representing sequences of
resources in some optimal way (a pull model). The push model makes more
sense for large numbers of small buckets that don't have individual
long-term identities. The pull model makes some sense when dealing with
scattered resources that have long-term identities.

My use cases are going to be small amounts of data spread across many
possible sources. A kind of diffuse data as opposed to big data. I'd like
to process search results as they are found instead of waiting to collect
everything in some open-ended fashion. I'm reminded of the old Medline
service from a few decades ago where I could start a search and come back
the next day to see how many articles had been found thus far. A more
useful example might be a software agent that processes results as they are
found and provides some summary on which the user can act to further refine
their question before all of the data has been found.

My interest isn't so much in how to modify SPARQL to allow streaming
queries, but in how to model the parts around the stream processing: how
the stream source provides the buckets of triples, how the processor in the
middle can convert one stream into another stream, and how a stream sink
can "walk" the stream. As long as these can be done without requiring
time-based sequencing (but allow time-based sequencing), then I think I can
make stream processing work for me.

A concrete example of where we could use this in the digital humanities:
collecting annotations about a set of resources from a set of annotation
stores. Streaming the annotations allows for progressive rendering as they
are received.

At MITH, we are working on a digital facsimile edition of Mary Shelley's *
Frankenstein* notebooks. We will allow the public to page through the
notebooks by showing a digital scan of each page with an accompanying
transcription (targeting the end of October). We are using the Shared
Canvas data model (http://shared-canvas.org/) to piece everything together.
One of the benefits of this data model is that we can incorporate
additional sets of annotations that aren't painted onto the canvas but
instead can be scholarly commentary about the physical object or the text
in the work. The annotations can also be links to secondary scholarship.

It would be nice to be able to run queries along the lines of "all of the
annotations targeting canvas A, B, or C" and not have to wait for the full
collection to be found before being able to start receiving the results.

There are a number of ways this can be done: a SPARQL service can push data
back a bit at a time, or the results can be modeled as an RDF list and
pulled by the client as needed (the rdf:first / rdf:rest properties would
act as the backbone pointing to the URLs at which the next results will be
provided). Neither of these should require any significant retooling of the
RDF processing toolchain.


I mentioned on the last call that it might be useful to think of stream
processing components as falling into three categories: sources producing
streams, sinks consuming streams, and stream processors consuming one set
of streams and producing another.

I'm mainly interested in the middle processing space. I've written a few
blog posts about this in the past thinking through what it might look like:

http://www.jamesgottlieb.com/2012/10/car-cdr-cons-oh-my/
http://www.jamesgottlieb.com/2013/08/streams-part-ii/

My colleague, Travis Brown, has written a post connecting the above to
iteratees: http://meta.plasm.us/posts/2013/08/26/iteratees-are-easy/


SEASR (http://www.seasr.org/) is a software product that provides a kind of
Yahoo! Pipes. It works on a stream concept.

I've written a Perl package that does some pipeline processing that could
be adapted to test interfaces:
https://github.com/jgsmith/perl-data-pipeline. I have similar code in
CoffeeScript for use in the browser. Both are
proof-of-concept instead of optimized for production use.

At MITH, we've been developing a triple-store lite (or simple graph
database) for use in the browser as a core part of an application
framework. It's based on MIT's Simile Exhibit code.
https://github.com/umd-mith/mithgrid . It might be useful as part of a
browser-based stream consumer/sink.

I'm also interested in exploring the use of Iteratees and similar
constructs to provide computation on RDF streams that is as transparent as
possible, though I'm more interested in trying this in Scala than Haskell
at the moment.

-- Jim

Received on Tuesday, 24 September 2013 13:43:37 UTC