Re: Conditional Requests to resolve semaphore and confidentiality concerns from Andy Seaborne on 2020-01-18 (public-sparql-12@w3.org from January 2020)

From: Andy Seaborne <andy@apache.org>
Date: Sat, 18 Jan 2020 17:26:59 +0000
To: "SPARQL 1.2 Community Group" <public-sparql-12@w3.org>
Message-ID: <16e336fe-6ef3-263b-eb0d-2951d1561a39@apache.org>
Hi Kjetil,

Interesting - some points about the semaphore issues, and security leaking.


Inline ...
On 16/01/2020 12:31, Kjetil Kjernsmo wrote:
> Hi all,
> 
> I'm working on the Solid project[1], where we use Semantic Web technologies
> intensively. For now, SPARQL is only used on the server side to update
> documents, and not using the SPARQL Protocol, a SPARQL 1.1 Update query is
> passed as the body of a PATCH request[2].
> 
> We have an open issue on the level of SPARQL 1.1 Update support Solid should
> require[3], and I have been working on two points where there are some
> tensions. I have a rather involved proposal to address them both in a
> backwards compatible way, that I want to air with you. The TL;DR is: We
> should support issue 63 [4] and introduce a conditional request header into
> HTTP.
> 
> These are the issues:
> 
> 1) A semaphore mechanism for updates.
 >
Observation:

If there is a semaphore being provided by atomically setting server
state (triples in graph or something else), then it is Dekker
semaphores/spinlocks.

I do wonder whether complex algorithms are a good idea. We can design 
complex, correct algorithms but that doesn't mean they are practical. 
They can be hard to get right and where they effect the way clients 
interact can have malicious effects.

These mechanism require all the clients to "play nice" and especially to
clean up properly.  Adding timeouts is obviously necessary for semaphore
integrity but when breaking a lock, presumably you want to reverse
changes in progress. If it's several change steps for one UX edit, then 
all the steps need undoing else exposing half a set of changes makes 
implementing clients very hard and accident prone.

> Imagine a room crowded by thousands of people who co-edit a document in real 
> time. Neither locking the document for every write nor using a simple ETag 
> for the entire document will be sufficiently scalable. We are obviously looking 
> into CRDT's, but lets not go there for now.
> > Concretely, say that client 1 goes:
> 
> DELETE DATA { <foo> <baz> "Dahut" } ;
> INSERT DATA { <foo> <baz> "Bar" }
> 
> independently, client 2 goes
> 
> DELETE DATA { <foo> <baz> "Dahut" } ;
> INSERT DATA { <foo> <baz> "Foobar" }
> 
> before the first client as finished. In that case, the Solid implementation
> would return a 409 Conflict to the second client.

1/
I understood that 409 happened when a WHERE matching returns zero or 
more then one result.
https://github.com/solid/solid-spec/pull/193/files

How does it happen in this example?

<digression>
It says it is a wilful violation but it isn't, strictly, a violation. It 
may be surprising (it is!). HTTP does not have a way to require certain 
behavior like 200 so the SPARQL spec can't either.

By the way, what happens if that semaphore 409 happens part way through 
the request?  Is the request atomic and the whole thing bounces, no changes?
</digression>


2/
DELETE DATA can have two uses.

   "remove a triple (assumed to be present)"
   "ensure a triple is not in the data"

Just looking at the requests, a system can't tell which is intended but 
in the first there is the 409 case and in the second it's fine.

> I then came to realize that an ability to see if a DELETE fails or succeeds
> has other implications for Solid too, as we have a permission system with
> Read, Write, Append and Control.
> 
> Ideally, DELETE should only require Write permission, but if you can infer
> from the status code whether a triple existed, then arguably, it should
> require a Read permission.

If you want write-only with no information leakage, I think that, except 
for specialized (data dependent) situations -- 
partitions/non-overlapping subgraphs -- it'll have be no information in 
the response.

I think W-access will imply fairly broad R-access.  Some situations, 
like partitioning into non-overlapping subgraphs, look possible but as a 
general mechanism, if the request has a response, it can reveal 
information.  A response is a "read" (that said, for general SPARQL 
Update, a bad actor can arrange to update the graph and use that as a 
response channel with the 409).

If this is limited to DELETE DATA, INSERT DATA, then treating changes
to the target resource as a log of change requests.  You don't make a 
change, you ask that a change be made.

> 2) A mechanism to communicate status from write queries safely.
> 
> To put it into an example, imagine a malicious user "Mallory": Mallory is
> authorized to write, but not to read, and does not particularly care if he
> destroys things, he just wants to check if certain triples were there. In
> that case, he can send the query
> 
> DELETE DATA {
>    <alice/profile#me> ex:age 14 .
> }
> 
> In SPARQL 1.1, Mallory cannot tell whether the triple was there since it will
> always succeed, so he can't tell that Alice was in fact 14 years old. So,
> DELETE with Write is OK. With semaphore mechanism we currently implement,
> Mallory can tell that Alice is 14, so it would be a breach of confidentiality
> to only require Write. It is therefore important to be careful not to reveal
> information when making updates.
> 
> Then, I found that Michael Rauch has a proposal around this in [4]. In
> particular, I liked Richard Cyganiak's take on this: Any such information
> should essentially be a projection. With that, we can ensure that Read
> permission is required to access any projected variable binding. To have that
> single point would be very useful. So, I strongly support that proposal.

There are a couple of things that came up in that issue: is the count is 
actual changes or the count of triples touched especially the WHERE case.

INSERT DATA { :s :p :o } ; INSERT DATA { :s :p :o }

Is that 0,1 (or in your non-atomic world, 2)
or always 2?  There are uses case for all those cases - different uses 
cases.

and?

INSERT { :s :p :o } WHERE { :s :p ?x }

In some implementations, testing whether add/delete makes an actual 
change to the graph is costly

c.f. LSM trees (RocksDB, LevelDB, ...). Adding the change to a log to be 
applied, and some in-memory view maintained, is less costly than 
checking several places in the data, let alone the case when a 
compaction is in progress.

> Then, what should we do on the protocol level to support our semaphore?
> 
> We should introduce another Conditional Request header, nominally "If-
> Variable" into HTTP. This is orthogonal to SPARQL, but the idea is that it
> names a variable, and if the Effective Boolean Value of that variable is
> false, the request will fail atomically with a 412 Precondition Failed.

Or put an IF (ASK) in the front of the update request.  At least then it 
is all in the request body.

> Whenever the semaphore mechanism is needed, the query needs to formulated
> with the REPORT mechanism as suggested in [4], have Read as well as Write
> permission and set the If-Variable header.
> 
> [1] https://solidproject.org/
> [2] https://github.com/w3c/sparql-12/issues/104
> [3] https://github.com/solid/specification/issues/125
> [4] https://github.com/w3c/sparql-12/issues/63
> [5] https://www.w3.org/DesignIssues/ReadWriteLinkedData.html
> [6] https://github.com/w3c/sparql-12/issues/60
> [7] https://tools.ietf.org/html/rfc7232
> 
> What do you all think?
> 
> Cheers,
> 
> Kjetil

     Andy
Received on Saturday, 18 January 2020 17:27:06 UTC