- From: James Smith <jgsmith@gmail.com>
- Date: Tue, 24 Sep 2013 09:43:09 -0400
- To: "public-rsp@w3.org" <public-rsp@w3.org>
- Message-ID: <CA+yy6fx43_p=Y=kuBUbE9USQfF3CHokRTdhk4VSdZUSJR5a5oQ@mail.gmail.com>
I'm not sure if my use case is ready for summarizing on the wiki, but I wanted to share it before tomorrow's call. I've added some links to blog posts and software packages at the end of the email. The examples I've seen so far on the wiki assume that buckets of RDF triples can be ordered in time. This is reasonable when working with realtime data as it is or was generated. I'd like to see temporal ordering as a specific case of a more general situation: processing buckets of RDF triples in some asynchronous or batched fashion. This would allow the existing use cases mentioned on the wiki while also allowing more general stream-based processing of triples. We could relax the requirement that the buckets be reproducibly sequenced if we don't require reproducibility of intermediate results. I'd like to see some way to have a SPARQL query stream results as they are found (a push model) and/or a recommendation for representing sequences of resources in some optimal way (a pull model). The push model makes more sense for large numbers of small buckets that don't have individual long-term identities. The pull model makes some sense when dealing with scattered resources that have long-term identities. My use cases are going to be small amounts of data spread across many possible sources. A kind of diffuse data as opposed to big data. I'd like to process search results as they are found instead of waiting to collect everything in some open-ended fashion. I'm reminded of the old Medline service from a few decades ago where I could start a search and come back the next day to see how many articles had been found thus far. A more useful example might be a software agent that processes results as they are found and provides some summary on which the user can act to further refine their question before all of the data has been found. My interest isn't so much in how to modify SPARQL to allow streaming queries, but in how to model the parts around the stream processing: how the stream source provides the buckets of triples, how the processor in the middle can convert one stream into another stream, and how a stream sink can "walk" the stream. As long as these can be done without requiring time-based sequencing (but allow time-based sequencing), then I think I can make stream processing work for me. A concrete example of where we could use this in the digital humanities: collecting annotations about a set of resources from a set of annotation stores. Streaming the annotations allows for progressive rendering as they are received. At MITH, we are working on a digital facsimile edition of Mary Shelley's * Frankenstein* notebooks. We will allow the public to page through the notebooks by showing a digital scan of each page with an accompanying transcription (targeting the end of October). We are using the Shared Canvas data model (http://shared-canvas.org/) to piece everything together. One of the benefits of this data model is that we can incorporate additional sets of annotations that aren't painted onto the canvas but instead can be scholarly commentary about the physical object or the text in the work. The annotations can also be links to secondary scholarship. It would be nice to be able to run queries along the lines of "all of the annotations targeting canvas A, B, or C" and not have to wait for the full collection to be found before being able to start receiving the results. There are a number of ways this can be done: a SPARQL service can push data back a bit at a time, or the results can be modeled as an RDF list and pulled by the client as needed (the rdf:first / rdf:rest properties would act as the backbone pointing to the URLs at which the next results will be provided). Neither of these should require any significant retooling of the RDF processing toolchain. I mentioned on the last call that it might be useful to think of stream processing components as falling into three categories: sources producing streams, sinks consuming streams, and stream processors consuming one set of streams and producing another. I'm mainly interested in the middle processing space. I've written a few blog posts about this in the past thinking through what it might look like: http://www.jamesgottlieb.com/2012/10/car-cdr-cons-oh-my/ http://www.jamesgottlieb.com/2013/08/streams-part-ii/ My colleague, Travis Brown, has written a post connecting the above to iteratees: http://meta.plasm.us/posts/2013/08/26/iteratees-are-easy/ SEASR (http://www.seasr.org/) is a software product that provides a kind of Yahoo! Pipes. It works on a stream concept. I've written a Perl package that does some pipeline processing that could be adapted to test interfaces: https://github.com/jgsmith/perl-data-pipeline. I have similar code in CoffeeScript for use in the browser. Both are proof-of-concept instead of optimized for production use. At MITH, we've been developing a triple-store lite (or simple graph database) for use in the browser as a core part of an application framework. It's based on MIT's Simile Exhibit code. https://github.com/umd-mith/mithgrid . It might be useful as part of a browser-based stream consumer/sink. I'm also interested in exploring the use of Iteratees and similar constructs to provide computation on RDF streams that is as transparent as possible, though I'm more interested in trying this in Scala than Haskell at the moment. -- Jim
Received on Tuesday, 24 September 2013 13:43:37 UTC