Re: Conditional Requests to resolve semaphore and confidentiality concerns from Kjetil Kjernsmo on 2020-01-27 (public-sparql-12@w3.org from January 2020)

From: Kjetil Kjernsmo <kjetil@kjernsmo.net>
Date: Tue, 28 Jan 2020 00:51:26 +0100
To: public-sparql-12@w3.org
Message-ID: <1599102.NIARNBpSgF@owl>
Hi Andy (and all)!

Many thanks for the response!

On lørdag 18. januar 2020 18:26:59 CET Andy Seaborne wrote:
> On 16/01/2020 12:31, Kjetil Kjernsmo wrote:
> > 1) A semaphore mechanism for updates.
> 
> Observation:
> 
> If there is a semaphore being provided by atomically setting server
> state (triples in graph or something else), then it is Dekker
> semaphores/spinlocks.
> 
> I do wonder whether complex algorithms are a good idea. We can design
> complex, correct algorithms but that doesn't mean they are practical.
> They can be hard to get right and where they effect the way clients
> interact can have malicious effects.

Yes, indeed! I believe it is important to have relatively simple and 
practical algorithms in this field, that does not blow up in a possibly messy 
Web.


> 
> These mechanism require all the clients to "play nice" and especially to
> clean up properly.  Adding timeouts is obviously necessary for semaphore
> integrity but when breaking a lock, presumably you want to reverse
> changes in progress. If it's several change steps for one UX edit, then
> all the steps need undoing else exposing half a set of changes makes
> implementing clients very hard and accident prone.

Right, so it wasn't the intention to design an algorithm that requires locks 
across requests, to the contrary, the idea to avoid cross-request locks, at 
the cost of that a client may never be able to write.

I should be careful not to put words in Tim's mouth, because we have not 
discussed in detail, so the following is largely my understanding.

In Dekker's terms, one client signals that it wants to enter by issuing a 
DELETE DATA, or DELETE ... WHERE. It is then that client's turn if there 
exists exactly one triple that matches (the triple or triple pattern 
respectively). If it that client's turn, it is allowed to enter the critical 
section, if not, it will simply be rejected and the delete is rolled back. 
Thus, there isn't really a wait in Dekker's terms, AFAIU. The rejected client 
will not know when it can enter, and it is likely it must GET the resource 
again, before it can indicate that it wants to enter. Tough luck, but those 
are the breaks ;-)

The main way clients may not "play nice" in this scheme is, I suppose, by 
entering complex or large queries, so that each individual request takes 
long. The server will need to protect against that, the server must make sure 
that each update is small compared to the expected workload, so that clients 
aren't rejected. 

Other than that, it is the server's responsibility to roll back and to reject 
clients, so in what other conditions are clients required to play nice?

> 1/
> I understood that 409 happened when a WHERE matching returns zero or
> more then one result.
> https://github.com/solid/solid-spec/pull/193/files

Yeah, it does in that example, but...

> How does it happen in this example?

it happens if the triple is present in the graph.

So, I chose to use DELETE DATA because it is simpler, and because it serves 
to illustrate the point with the data leak (since a WHERE clause clearly 
requires Read anyway)

> <digression>
> It says it is a wilful violation but it isn't, strictly, a violation. It
> may be surprising (it is!). HTTP does not have a way to require certain
> behavior like 200 so the SPARQL spec can't either.

OK, I didn't quite parse that sentence, but the fact that we require a 
success/fail status from the query itself, doesn't that violate the spec?

> By the way, what happens if that semaphore 409 happens part way through
> the request?  Is the request atomic and the whole thing bounces, no
> changes?

Yes, absolutely. Within one request, this is a reasonable expectation, I 
think.


> 2/
> DELETE DATA can have two uses.
> 
>    "remove a triple (assumed to be present)"
>    "ensure a triple is not in the data"
> 
> Just looking at the requests, a system can't tell which is intended but
> in the first there is the 409 case and in the second it's fine.

First, the semaphore mechanism is needed only on updates, i.e. a DELETE 
followed by an INSERT. The first case can participate in such an operation, 
the latter would not.

But for the sake of the argument, it is also why I chose to rely on this 
projection mechanism and the conditional request header, since it gives the 
client an opportunity to say it.


> If you want write-only with no information leakage, I think that, except
> for specialized (data dependent) situations --
> partitions/non-overlapping subgraphs -- it'll have be no information in
> the response.

Yes, indeed.

> 
> I think W-access will imply fairly broad R-access.  Some situations,
> like partitioning into non-overlapping subgraphs, look possible but as a
> general mechanism, if the request has a response, it can reveal
> information.  A response is a "read" (that said, for general SPARQL
> Update, a bad actor can arrange to update the graph and use that as a
> response channel with the 409).

I have advocated that any query with a WHERE clause should require Read, that 
should cover it, right?

I could imagine a class of queries where variables from one graph that you 
don't have access to could participate in the query, but not be projected, 
but in that case, I think we should have another access mode.

> There are a couple of things that came up in that issue: is the count is
> actual changes or the count of triples touched especially the WHERE case.
> 
> INSERT DATA { :s :p :o } ; INSERT DATA { :s :p :o }
> 
> Is that 0,1 (or in your non-atomic world, 2)
> or always 2?  There are uses case for all those cases - different uses
> cases.

So, in my case, only an EBV is important, so I can dodge that question :-) 

> and?
> 
> INSERT { :s :p :o } WHERE { :s :p ?x }
> 
> In some implementations, testing whether add/delete makes an actual
> change to the graph is costly
> 
> c.f. LSM trees (RocksDB, LevelDB, ...). Adding the change to a log to be
> applied, and some in-memory view maintained, is less costly than
> checking several places in the data, let alone the case when a
> compaction is in progress.
> 

In the EBV case, can we simplify the requirement to accommodate for that?


> > Then, what should we do on the protocol level to support our semaphore?
> > 
> > We should introduce another Conditional Request header, nominally "If-
> > Variable" into HTTP. This is orthogonal to SPARQL, but the idea is that
> > it
> > names a variable, and if the Effective Boolean Value of that variable is
> > false, the request will fail atomically with a 412 Precondition Failed.
> 
> Or put an IF (ASK) in the front of the update request.  At least then it
> is all in the request body.

Actually, I toyed with the idea that we could introduce some bashisms, 

DELETE DATA { <foo> <baz> "Dahut" } &
INSERT DATA { <foo> <baz> "Foobar" }

which would mean only execute the second query if the first is successful... 
But then, I found that projection and protocol level mechanisms were more 
interesting.

Cheers,

Kjetil
Received on Monday, 27 January 2020 23:51:55 UTC