Re: Scale issues : streaming, latency, whole request validation from Sandro Hawke on 2013-09-08 (public-ldp-patch@w3.org from September 2013)

From: Sandro Hawke <sandro@w3.org>
Date: Sun, 08 Sep 2013 09:27:42 -0400
To: Andy Seaborne <andy@apache.org>
CC: public-ldp-patch@w3.org
Message-ID: <522C7B4E.4050406@w3.org>

On 09/08/2013 06:23 AM, Andy Seaborne wrote:
> Streaming is valuable at a scale because changes can start to be 
> applied as the data arrive, rather than buffering the changes until 
> the end of change is seen, then applying the changes.  For large 
> changes, this also impacts latency. Doing some or all of the changes 
> as data arrives gets an overlap in processing between sender and 
> receiver.
>

Hm.

Is this streaming visible to the outside world?   That is, might another 
client see some evidence of the patch being processed before it's 
completed?   If so, then of course we're giving up atomicity (and thus 
also consistency and isolation; cf ACID) which could be a problem for 
some applications (to put it mildly).   Is that something we want to do?

(I'm not sure what you mean by "latency".  Perhaps you mean the delay 
before one sees the first evidence of the patch being processed; perhaps 
you mean the time until the patch processing is complete.)

Are there advantages to streaming that can't be obtained by using a 
sequence of smaller patches, perhaps aggregated into one file? (I've 
been thinking of this as a "multipatch".)    With a multipatch, each 
sub-patch is atomic, but the overall set is not, so the client can 
decide how much atomicity it needs.

Alternatively, a purely-streaming system could have explicit transaction 
controls, with a begin-transaction command and a commit command.

I know SPARQL has avoided requiring ACID.   I see from 
http://answers.semanticweb.com/questions/11523/acid-in-triple-stores 
that you were testing it in Jena's TDB.  How did that work out?

> Several proposals need the complete patch request to be seen before it 
> can start to be processed.
>
> Any format (e.g Talis ChangeSets) that is RDF or TriG can't make any 
> assumptions of the order of triples received. In practice, a changeset 
> must be parsed into memory (standard parser), validated (patch format 
> specific code) and applied (patch format specific code).  There is 
> some reuse of a common parser but validation has to be done on top.
>
> These are limitation at scales, where scale means most or or more than 
> available RAM.
>
> This may be acceptable - for any format that is a restriction of 
> SPARQL it maybe desirable to check the whole request is in the 
> required subset before proceeding with changes (e.g. no true 
> transaction abort available).
>
> The bnodes issue and the scalability are what motivated:
>
> http://afs.github.io/rdf-patch/

So, yeah, how is that different operationally from a sequence of 
1-triple patches in any of the other languages?

        -- Sandro
>
>     Andy
>
>

Received on Sunday, 8 September 2013 13:27:53 UTC