Re: ldp wishlist for crosscloud from Sandro Hawke on 2014-11-10 (public-ldp-wg@w3.org from November 2014)

From: Sandro Hawke <sandro@w3.org>
Date: Sun, 09 Nov 2014 21:48:38 -0500
To: "henry.story@bblfish.net" <henry.story@bblfish.net>
CC: Linked Data Platform WG <public-ldp-wg@w3.org>
Message-ID: <54602786.4090003@w3.org>
On 11/09/2014 06:51 PM, henry.story@bblfish.net wrote:
>> On 9 Nov 2014, at 23:40, Sandro Hawke <sandro@w3.org> wrote:
>>
>> On 11/09/2014 02:27 PM, henry.story@bblfish.net wrote:
>>> Hi Sandro,
>>>
>>>     thanks for your very detailed feedback. As you know I have been working in the same space for
>>> a long time too. So we have been crossing the same problems too. Here's my feedback.
>>>
>>>> On 9 Nov 2014, at 18:06, Sandro Hawke <sandro@w3.org> wrote:
>>>>
>>>> As you may know, these days most of my time is no longer W3C-staff but is funded research toward building "Crosscloud", an architecture for software which allows users to control their data and which should encourage innovation by making it much easier to develop powerful, open multi-user (social) software.
>>>>
>>>> Back in January, we started off building on LDP, with Andrei creating cimba.co.  It's a microblogging app, intending to replicate some of early Twitter, in a completely decentralized way using generic (non-application-specific) LDP.    To make that work, we had to extend LDP with Access Control; clients can tell the server who can do what with each research.   We also made no use of Direct Containers, Indirect Containers, or Paging.   It's just Basic Containers, Access Control, WebID-TLS for client authentication, Turtle for data, and non-RDF resources for photos.     (Maybe I'm forgetting some details; for demo see 2 minute video at [1].)
>>> Currently that is all I have been using too.
>>>
>>> Earlier this year I was working with a startup to build a social networking platform built on
>>> that same architecture. In 3 months we got a dynamic userinterface in the browser to work
>>> using just those tools you mentioned, with the client doing all the fetching of resources
>>> through a CORS proxy for remote ones.
>>>
>>> At the time we used Tim Bernes' Lee's rdflib.js library, but this meant
>>>   1) the code on the server was not the same as on  the client leading to a duplication of work
>>>   2) if I wanted to switch to another library ( say RDFStore by Antonio Garrote ), I'd have to
>>>    rewrite a lot of the client code.
>>>   3) JS is not very good for building good abstractions, there being no real compiler support
>>>
>>> Banana-RDF solves 2) for Java libraries such as Jena and Sesame, alowing one to switch between either
>>> with one line of code and no loss of efficiency.  With the appearance of Scala-JS it became
>>> possible to forsee solving 1-3 too. Scala-JS = allows us now to compile Scala to JavaScript
>>> and to use the same code on the client and on the server. It furthermore will allow us to write much
>>> quicker very nice abstractions that are type safe and checked by the compiler.
>>>
>>> See the code here:
>>>     https://github.com/w3c/banana-rdf/
>>>
>>> So with those tools available I have been thinking about how to solve the problems you
>>> have put forward.
>>>
>>>> While cimba basically works, it's painful in various ways and unable to do many things, showing us that we need much more support from the servers.    We've also started building several more apps which are showing other things that are important to have.
>>>>
>>>> We don't have it all figured out yet, let alone implemented, but here are a few of the thing we probably need.   I'm providing this list to help with re-chartering, although most of these are not yet mature enough for standardization.   Maybe they will be in 6-12 months, though.    As you look at this list, one thing to figure out is how will we know when this module is ready for the WG to take up.
>>>>
>>>> == 1.  Queries
>>>>
>>>> This is a big one.   It's impractical to have the cimba WebApp, running in the browser, do all the GETS (hundreds, at least) every time it starts.  It needs to do a small number of queries, and have the server manage all the aggregation.   The server has to be able to query across other servers as well as itself.
>>>>
>>>> We're currently playing with forms of "Link-Following SPARQL", but also with a more restricted MongoDB-like query language, both for easier implementation and for response-time/load guarantees.
>>> I wonder if just adding a QUERY method  may not get us a long way towards the goal
>>> ( I proposed this http://lists.w3.org/Archives/Public/public-ldp/2014Oct/0003.html )
>>> where I wrote:
>>>
>>> [[
>>> - in addition to PATCH allow for a QUERY method. This was suggested in early HTTP specs
>>>         http://www.w3.org/Protocols/HTTP/Methods.html
>>>
>>>      The advantage is that just with the knowledge of a URI one would be able to query that URI
>>>      directly. One could imagine a query on an LDPC allowing the query of the contents of the LDPC
>>>      too.
>>> ]]
>> I doubt we could convince the IETF HTTP WG to allocate another verb.
>>
>> Fortunately, it's easy enough to just have a link to query end point instead.
>>
>> You HEAD or GET a container, and get
>>
>>   Link: <queryService> rel="sparql-query-over-contained-items"
>>
>> or something.
> yes, of course, you can always do that. You can create a delete, patch, and put resource
> in that way.  You can also have all service calls go through a single URL like xmlrpc,
> and you can then invent SOAP. I know your intent is not to do that, but the argument "you can"
> also applies there.
>
> The advantage of a verb is that you reduce the number of resources down to a minimum,
> and it reduces epistemological dissonance.
> So if I want to query a particular resource I don't have to worry that one resource
> is pointing me to another one that is out of sync, that now for some reason does
> something very different from what I thought it was going to do: perhaps POSTing
> a QUERY there now archives it. How does one do conditional QUERY on another resource?
>      ( of course there is some way one can do it, but how much more complicated
>       is it over just doing the right thing )
> This would also be a nice way to make SPARQL fully RESTful.
>
> The HTTP verbs form part of what in philosophy of language were called Speech acts.
> And in those one usuall found a few general types:
>    - declarative expression of something ( similar to GET )
>    - the making of something something, eg. a marriage "You are now man and wife" (POST)
>    - asking a question ( QUERY )
>    - ordering something to happen
>    - taking something back ( PATCH ? )
>
> Anyway it feels like this is the right level to do something, and I'd say that
> given that it was considered very early on in HTTP may give it a lot more weight
> than other methods. That's why I pointed here:
>
>     http://www.w3.org/Protocols/HTTP/Methods.html
>
> One can of course do a band aid solution.

I hear you.   I'm just dubious about convincing the HTTP WG that it's 
now worthwhile to have something they haven't felt a pressing need for 
in the past 25 years.   If the only people clamoring for this new verb 
are us, I'm guessing that wont work.

>
>>>
>>> As you say: the problem is that the browser cannot load all the information.
>>>
>>> So what is needed is for a way for the browser to get partial information for a remote graph,
>>> or rather for a LinkedPointedGraph. See my slides for last Wednesday's talk at SemWebPro for
>>> some intution on that.
>>>
>>>      http://bblfish.net/tmp/2014/11/05/SemWebPro2014.pdf
>>>
>>> PointedGraphs gives you a OO way of looking at a graph, that is easy to understand
>>> for JS devs. And with Scala you can build very nice DSLs for working with them, that
>>> are as easy to use as OO notation. LinkedPointedGraphs probably form a category.
>>> So it is easy with such a DSL for a Dev with a few weeks of RDF experience to start
>>> to be able to write very good code.
>> I'm working on a different way to make this palatable to non-RDF folks, myself.  Orthogonal to LDP, though, I think.   Or, I hope.
>>
>>> But we don't want to get all the information from the remote server in one go. We'd
>>> rather want to get just the information needed to build the current interface. And
>>> for that we can build a special type of RDFStore in the browser that is aware of
>>> what it has fetched on the server: a Partial Quad Store if you wish. It would fetch
>>> more information as soon as a query oversteps the limits of what it knows to have
>>> fetched.
>>>
>>> The best way to get partial information I know of would be to be able to query
>>> resources directly with a SPARQL query, especially a DESCRIBE one, and it is true
>>> that this does come with a form of paging. So if every resource on my server could
>>> be QUERIED directly then
>>>
>>>   1) I would be able to have the server fetch remote resources that did not have
>>>    this capability, and server them ( potentially in a protected way ) to the client
>>>    with the QUERY capability
>>>   2) the server could agregate this information in its local view and allow the client
>>>     developer to write code as if he had all the information available, reducing the
>>>     traffic that web2.0 apps tend to have, as they tend to have to create services for
>>>     all the different ways one would want to agregate data.
>>>
>>>
>>> As it happens one could go a long way with a SPARQL as the query format for the HTTP
>>> QUERY method. But with content negotiation other syntaxes could be developed.
>> Right.     (and no need for a query method)
>>
>>>
>>>> Queries make resource-paging obsolete, which is why I've lost interest in paging.
>>>>
>>>> == 2.  Change Notification to Web Servers
>>>>
>>>> If a server acting on behalf of the end-user is going to aggregate data from other servers, it needs to be able to keep its copy in sync. Traditional web cache + polling works only when it's okay to be seconds or minutes out of date; many multi-user apps require much more responsiveness than that, so we see a need for one server to be able to subscribe to change notification from another.
>>>>
>>>> One might want something like PATCH to make this more efficient, but at the moment it looks like we can keep the resources small enough that it doesn't matter.
>>> Change notification should be very easy to implement with LDP Direct or Inderect Containers.
>>>       1) a resource publishes in its header a link to a notification service
>>>    2) a client follows that link and POSTS a 'notification request' including data about where his notification Container, and
>>>       gets bound to be notified for any changes on the resource ( or a subset of them )
>>>       ( of course WebAccessControl and WebID authentication help reduce spam to reasonable amounts )
>>>       [ binding is what the current spec tems "membership resources", a complete misnomer ]
>>>    3. If the the resource changes the server can post a message to the notification container of the user. The server of course
>>>      can authenticate with WebID.
>> Yes, the fact that any of us could sketch out a system like this suggests maybe this part is ready for standardization.
>>
>>>> == 3.  Change Notification to Web Clients
>>>>
>>>> Similarly, Web Apps often need to know immediately when data has changed.   While it might be nice to have this be the same protocol as (2), our preliminary investigation suggests the engineering trade-offs make that impractical.   So, this needs to be its own protocol. Probably it's just a tweak to the query protocol where query results, rather than being a single response collecting all the results, are ongoing add-result and remove-result events.
>>> Web Sockets seems to be the thing to do here.
>> Right, but there are still bits to standardize.   Like how exactly to signal add-result and remove-result in SPARQL results format, or whatever.
>>
>> Also, I think one needs some flow control mechanisms, such a max-frame-rate (for data that's changing in place, and missing some updates wont make anything wrong), and maybe some analogous to TCP's window size for when the client can't handle changes fast enough.
>>
>>>> == 4.  Operation over WebSockets
>>>>
>>>> It almost certainly makes sense to use WebSockets for (3), but it also makes sense to use them for all the current LDP operations for high performance. A modest client and server can probably process at least 1000 GETs per second, but in practice, without WebSockets, they'll be slowed an order of magnitude because of round trip delays.    That is, say RTT is 50ms, so we can do 20 round trips per second.    Most browsers allow at most 6 connections per hostname [2], so that's 120 round trips per second, max, no matter how much CPU and RAM and bandwidth you have.
>>>>
>>>> I'm still thinking about what this might look like.     Strawman is something like each client-to-server message is a JSON object like { "verb": "GET", "resource":"http://example.org", "accept": "text/html", "seq":7 } and response are like { "in-reponse-to": 7, "status": 200, "contentType": "text/html", "content": "<html>......</html>" }
>>>>
>>>> So the higher levels don't have to know it's not normal HTTP, *but* we can have hundreds or thousands of requests pipelined.     Also, we can have multiple responses, or something, for event notification.   This would also allow for more transactional operation, if desired.   (Maybe "partial-response-to" and "final-response-to".)
>>> IS that not what SPEEDY ( HTTP 2.0 ) promises to do?
>> Possibly.   I need something like this right now, but maybe this is a reason not to standardize in this space.
>>
>>>> == 5.  Non-Listing Containers
>>>>
>>>> I want end-points that I can POST to, and GET some information about, without being swamped by an enumeration of everything posted there.   I don't want to have to include a Prefer header to avoid that swamping.
>>>>
>>>> You might consider this a taste, but I think it's an important usability issue.
>>>>
>>>> Again, with querying, you probably don't want to just be dumping the list of contained resources.   Querying also lets us control inlining, etc.   Basically, if querying is available, I think we can skip serializing membership/containment triples.
>>> so is that another good reason to have the QUERY HTTP verb?
>> Well, also to change the default to not listing those triples. Which means a new kind of container.   lpd:QueriableContainer or something.
> How do you then find out what the contents are?
>
> For example the intresting thing about an ldp:Container is that
> you could use SPARQL to query the named graphs for the LDPR Sources that are part of it.
> So your named graphs just are everything that your container has an ldp:contains relation to it.
> Then suddenlty SPARQL and LDP start to work together as if they had been designed to do so
> from the get go.
>
> QUERY /ldpc HTTP/1.0
> Accept: application/sparql-query
> Content-Length: 78
>
> SELECT ?ldpr
> WHERE {
>     <> ldp:contains ?ldpr .
>     GRAPH ?ldpr {
>        ?ldpr dc:author <http://example.org/joe#me> .
>     }
> }
>
> It also becomes really clear then what a named graph is
> meant to be then, and how it ties in with HTTP: you should
> be able to do a GET on it, and even do a QUERY with the internal
> version in it. Any result to the above query should be a URL
> to whose referent you could send a QUERY HTTP message with

How you've shown it matches what we have now, yes, but it has the 
problem that <> is uncomfortably large.

(I'm also not sure <> is properly defined to work this way in SPARQL, 
but of course I know what you mean.)

One alternative is:

    SELECT ?ldpr
    WHERE {
        <> ldp:enumeration ?enum.
        GRAPH ?enum { <> ldp:contains ?ldpr }
        GRAPH ?ldpr {
           ?ldpr dc:author <http://example.org/joe#me> .
        }
    }


A somewhat simpler but more radical alternative is:

    SELECT ?ldpr
    WHERE {
        GRAPH ?ldpr {
            ?ldpr dc:author <http://example.org/joe#me> .
        }
    }

which make an even more stark claim about what an LDP Container (at 
least this type of LDP Container) really is, namely an RDF Dataset.    
That is, x is a resource in the container iff x is a named graph in the 
dataset.

Of course this relies on the notion of non-RDF-LDPRs being just LDP-RS's 
with content.

It's simple and elegant, but perhaps so simple it causes some problems.


> ASK { ?ldprd c:author <http://example.org/joe#me> . }
>
> and get true ( on a conditional get ).
>
> Now it even becomes easy to see what kind of metadata you should put
> on a resource: a lot of the metadata stuff that you expect to find
> in the HTTP header. author, modification time, etag, ...
>
>>>> == 6.  PUT-to-Create
>>>>
>>>> There are situations where the client needs to lay out, on the server, an assortment of resources with carefully controlled URLs, such as a static website with interlinked html, css, js, images, etc.    This should be doable with PUT, where PUT creates the resource inside the container that owns that URL space.
>>> yes, and one way to do that is with intuitive Containers.
>>>    http://www.w3.org/2012/ldp/track/issues/50
>>> Otherwise you won't know where to PUT
>> Right, this is an old issue.
>>
>>>> == 7.  DELETE WHERE
>>>>
>>>> One of our current demo apps is a game that is likely to generate a dozen resources per second per user.   Asking for each of those resources to be individually deleted afterwards seems rather silly, even problematic, so a DELETE WHERE operation would be nice.
>>>>
>>>> Yes, one could put them all in a container in this case, and define it as a kind of container that deletes its contained resources when it's deleted,, but there are situations where that wont work as well.  Maybe we want to delete the resources after about 60 seconds have gone by, for example.   Easy to do with a DELETE WHERE, hard to do otherwise.
>>> Would a  SPARQL Update used with the PATCH verb allow you to delete all the ldp:contains relations with certain metadata?
>>> Assuming you can only delete an ldp:contains if the server deletes the resource.
>> You don't need the PATCH verb to do a SPARQL UPDATE.  Just do a SPARQL UPDATE.
> SPARQL Update is a syntax. PATCH is an HTTP verb. Becaues of content negotation you can
> of course use any language for PATCH including SPARQL Update.

Not sure if you really understood me, or are just being pedantic. When I 
say "do a SPARQL UPDATE" I mean "Use the SPARQL Protocol to do a SPARQL 
UPDATE".  Of course you're right it CAN be used with PATCH.

>> And yes, I guess you could remove ldp:contains as a way to delete resources.   I'd prefer to delete the named graphs corresponding to the resources, myself.
>
> PATCH /ldpc HTTP/1.0
> Accept: application/sparql-update
> Content-Length: 78
>
> DELETE { <> ldp:contains ?ldpr }
> WHERE {
>     <> ldp:contains ?ldpr .
>     GRAPH ?ldpr {
>        ?ldpr dc:author <http://example.org/joe#me> .
>     }
> }
>
> neat. You patch and you have a really powerful language to delete a number of
> resources.

So in this model, deleting the ldp:contains triple implicitly deletes 
all the triples in ?ldpr and does a DROP GRAPH on it?

It looks like DROP GRAPH cannot be used with a WHERE clause, but perhaps 
there's a way I'm missing.

>
>>>> ==  8.  WebMention for Data, backlinks used in Queries
>>>>
>>>> The basics of WebMention are in-scope for the Social Web WG, but it's not clear they'll apply it to arbitrary raw data, or say how the back-links are made available for use in queries.   Like many of these, this might be joint work with SWWG.
>>>>
>>>> ==  9.  Client Authentication
>>>>
>>>> Arguable this is quite out of scope, and yet it's hard to operate without it.   Especially things like (2) are easier with some kind of authentication.
>>>>
>>>> For a strawman of how easy it could be: https://github.com/sandhawke/spot/blob/master/spec.md
>>> Well arguably you need Client Authentication to be distributed and global if you want to build a social web.
>>> WebID http://webid.info/spec works on current systems, and should easily be adaptable as protocols evolve.
>> I believe SPOT serves as well as WebID-TLS for the purposes I've considered, at least.
>>
>>  From where I'm sitting, WebID-TLS is dead, because the browser vendors are not willing to sufficiently support it.
> They don't do too bad a job of it considering it is so little used. What is needed is for it to be used a lot
> more for the browser vendors to support it more. Just like any other standard you try to put forward.

I think it's a dead end, and I don't intend to put any more effort into it.

> WebID-TLS does not require a browser. It will work well between servers in fact. Perhaps it is even easier to
> have it work correctly there.

Yes, indeed, in that context it's okay.

> In any case you'll always need cryptography in the browser to get a simple global authentication to work.

I believe SPOT ends up as secure as WebID-TLS using only normal server 
certificate TLS.

> Btw, WebID authentication does not require TLS. It could be done with any other cryotpgraphic system.

Or no cryptosystem, as in SPOT.     (I'm not using the WebID brand on 
SPOT, but conceptually it's the same, with people/clients being 
identified by a dereferenceable URL.)

>>>> == 10.  Access Control
>>>>
>>>> Obviously.
>>>>
>>>> My current radical theory is I only need is a flag that a page is owner-only, public, or group-read, and then a way to define the group of identities (see (9)) who can read it.    Most people imagine we need to control a lot more than read access, and perhaps we do, but I'm currently working with the theory that everyone makes their own contributions in their own space, notifying but never actually "writing" to anyone else's.
>>> It would be nice. But a notification system would work best by allowing one to write to other servers.
>> Sure, I'm happy to think of notification as writing.    My point is that's the ONLY kind of writing I want to support.
> That could cover a lot.
> I think there are other cases such as data wikis.

My radical theory is the people's contributions to *anything* should 
have their primary residence as their own server.   I don't know what a 
data wiki is, but why wouldn't I want the original copy of my 
contribution to be on my own server?

>> I know this is a radical view -- everyone else seems to want to be make random resources writable by random people -- but in my view that complicates things unnecessarily.
> Well that's why you have access control and global authentication. Then you don't give access to random people,

by "random" I just mean "arbitrary" -- a potentially large and complex set

> but you are flexible about whome you give access to. Also if you are clever you then do versioning so you can
> come back to a previous state in case of an error.

the need to be clever is one of the things that steers me the other way

> So that brings up versioning as another topic that one could deal with.
> Is it enough to have link headers to previous versions of a resource?

Good question.    Maybe versioning should be on the wishlist, too.

      -- Sandro

>
>>>> == 11.  Combined Metadata and Content operations
>>>>
>>>> I don't think I can put this very crisply, but I've started thinking about resources as looking like this:
>>>>
>>>> { property1: value1,
>>>>    property2: value2,
>>>>    ...
>>>>    content: "<html>....</html>",
>>>>    contentType: "text/html"
>>>>    ...
>>>> }
>>>>
>>>> and it's so much nicer.   Basically, every resource is properties-value pairs, and some of that pv data is "content".    If you don't do something like this, queries and notifications and all that require us to bifurcate into a mechanism that's all about the content and another that's all about the metadata.
>>>>
>>>> LDP-RS's then become content-free resources, or null-content resources, but much less fundamentally different.   With the current LDP framing, what happens when you PUT an image to an LDP-RS or PUT rdf to what you created as an image?   This model clears that up nicely.
>>> yes, that's the way the web works now. The top stuff is called the headers :-)
>> Cute.   But of course that's not how LDP is defined, and the way HTTP headers are defined makes it a little awkward.   Also, HTTP headers are often limited in size as small as 4K, total of all headers.
>>
>> I'm not sure I'm looking for any change in protocol here, so much as a change in how we talk about it.
>>
>> Also, headers are metadata from the server, not from other applications, I think.....
>>
>>>> But this might only work in the face of other assumptions I'm making, like the only triples at <R> are in a graph rooted at <R>, so you can think of them all as properties of R.    Also I've resolved httpRange-14 by saying I'm only interested in proper information-resource-denoting URLs, and you can use indirect properties for talking about people, places, events, etc.    Maybe those radical assumptions are necessary for making this work.
>>>>
>>>> 12.  Forwarding
>>>>
>>>> We need to be able to move resources, because it's very hard to pick a URL and stick to it for decades.   And if it's used as part of other apps, and you don't stick to it, you'll break them.   The fear of this will, I suspect, significantly impede adoption.
>>>>
>>>> I propose three mechanisms.   Any one of them might work; between the three I'm fairly confident.
>>>>
>>>> 1.  Servers SHOULD check all their outgoing links at least once every 30 days.   If they get a 301 response, they SHOULD update the link in place.   Valid reason not to change it is this is some kind of a frozen/static page that can't be changed.
>>>>
>>>> 2.  When a client gets a 301, following a link it got from server A, it should notify server A, so A can rewrite the link sooner.   This could use a .well-known end-point on A, or there could be a Report-Link-Issues-To header on every resource which A serves telling clients how to report any 301s (and 404s) it finds.
>>>>
>>>> 3.  The notification mechanism (2) above, should include move notifications, so when a page is being watched, if it moves the watcher will be immediately notified and able to change its link.
>>>>
>>>> All this works much better if in addition to 301 we have a way to say a whole tree has moved.    That is, all URLs starting http://foo.example/x/ should not be considered redirected to http://bar.example/y/, etc.
>>>>
>>>> With these mechanisms in place, links from compliant servers should start to transition quickly and drop off to zero after 30 days. Obviously links from hand-maintained resources, and printed on paper, etc, wont change, but those are usually consumed by humans who are better able to deal with a broken link anyway.
>>> intereseting idea.
>> Yeah, I thought of this a few weeks ago, and as I run it by people I'm mostly getting slow nods.
>>
>>>> == More...
>>>>
>>>> I'm sure there's more, but this gives the general shape of things. Do we want the new charter to target some of these?   To allow for some of these?   And again: how do we assess when each of these is mature enough for a WG to begin looking at it?
>>> I think there should be a way to POST a tar.gz of a whole directory, and have all the files and all the relative links between the files
>>> work out correctly.
>> Yes, bulk post would be good.  I'd been thinking multipart, but .tz is probably way easier..
>>
>>     -- Sandro
>>
>>>> Thanks for considering this.
>>>>
>>>>       -- Sandro
>>>>
>>>>
>>>> [1] https://www.youtube.com/watch?v=z0_XaJ97rF0
>>>> [2] http://www.browserscope.org/?category=network&v=top
>>>>
>>> Social Web Architect
>>> http://bblfish.net/
> Social Web Architect
> http://bblfish.net/
>
>
Received on Monday, 10 November 2014 02:48:49 UTC