Re: Scale issues : streaming, latency, whole request validation from Andy Seaborne on 2013-09-08 (public-ldp-patch@w3.org from September 2013)

From: Andy Seaborne <andy@apache.org>
Date: Sun, 08 Sep 2013 19:00:22 +0100
To: Sandro Hawke <sandro@w3.org>
CC: public-ldp-patch@w3.org
Message-ID: <522CBB36.2000803@apache.org>
On 08/09/13 14:27, Sandro Hawke wrote:
> On 09/08/2013 06:23 AM, Andy Seaborne wrote:
>> Streaming is valuable at a scale because changes can start to be
>> applied as the data arrive, rather than buffering the changes until
>> the end of change is seen, then applying the changes.  For large
>> changes, this also impacts latency. Doing some or all of the changes
>> as data arrives gets an overlap in processing between sender and
>> receiver.
>>
>
> Hm.
>
> Is this streaming visible to the outside world?   That is, might another
> client see some evidence of the patch being processed before it's
> completed?   If so, then of course we're giving up atomicity (and thus
> also consistency and isolation; cf ACID) which could be a problem for
> some applications (to put it mildly).   Is that something we want to do?

Only with a stopwatch.

> (I'm not sure what you mean by "latency".  Perhaps you mean the delay
> before one sees the first evidence of the patch being processed; perhaps
> you mean the time until the patch processing is complete.)

The total elapsed time for request from client-start to confirmation of 
success.

> Are there advantages to streaming that can't be obtained by using a
> sequence of smaller patches, perhaps aggregated into one file? (I've
> been thinking of this as a "multipatch".)    With a multipatch, each
> sub-patch is atomic, but the overall set is not, so the client can
> decide how much atomicity it needs.

Yes - 2 things:

1/ Natural units of update

The app wants to make a numbers of related changes and talk about them 
e.g. timestamp, reason for change.

2/ Boundaries

Having all the changes in one HTTP action means you can say "it SHOULD 
be atomic".  Etags does not extend to multiple updates.

POSTing in one go is natural. (Having seen what happens if it has to be 
many updates, I can say from experience at Talis that it can get quite 
messy.)

(atomic !=> transactions, that's one, high quality way of providing it)

> Alternatively, a purely-streaming system could have explicit transaction
> controls, with a begin-transaction command and a commit command.

Yes - if using RDF patch in a long stream, we do probably need to add 
in-stream markers but LDP is about streams (c.f. RDF Stream Processing 
Community Group).

> I know SPARQL has avoided requiring ACID.   I see from
> http://answers.semanticweb.com/questions/11523/acid-in-triple-stores
> that you were testing it in Jena's TDB.  How did that work out?

Great!  It's ACID down to the disk.  I kill running servers with "kill 
-9" or just shutting down down the OS without warning i.e. I trust it!

The issues in the WG were whether to require it of all impls.

Lightweight impls simply aren't going to be interested.  It's a 
non-functional feature you may or may not want to be burdened with.

The discussions got into nested transactions and multiple-HTTP operation 
transactions.  Enterprise viewpoint, not very REST. For example, "POST 
large file" over HTTP is not guaranteed.  RFC 2616 sec 8.2.4 is all you 
get.  PUT is idempotent - a different approach to failure.

>> Several proposals need the complete patch request to be seen before it
>> can start to be processed.
>>
>> Any format (e.g Talis ChangeSets) that is RDF or TriG can't make any
>> assumptions of the order of triples received. In practice, a changeset
>> must be parsed into memory (standard parser), validated (patch format
>> specific code) and applied (patch format specific code).  There is
>> some reuse of a common parser but validation has to be done on top.
>>
>> These are limitation at scales, where scale means most or or more than
>> available RAM.
>>
>> This may be acceptable - for any format that is a restriction of
>> SPARQL it maybe desirable to check the whole request is in the
>> required subset before proceeding with changes (e.g. no true
>> transaction abort available).
>>
>> The bnodes issue and the scalability are what motivated:
>>
>> http://afs.github.io/rdf-patch/
>
> So, yeah, how is that different operationally from a sequence of
> 1-triple patches in any of the other languages?

1/ It's quads :-)

2/ BNode labels are real - it is not an RDF syntax because of that.

(from above)
3/ Being able to talk about one app-viewpoint change,

but, yes, logically it's recording 1-quad changes and replaying them.

>
>         -- Sandro
>>
>>     Andy
>>
>>
>
Received on Sunday, 8 September 2013 18:00:50 UTC