ldp wishlist for crosscloud from Sandro Hawke on 2014-11-09 (public-ldp-wg@w3.org from November 2014)

From: Sandro Hawke <sandro@w3.org>
Date: Sun, 09 Nov 2014 12:06:52 -0500
To: Linked Data Platform WG <public-ldp-wg@w3.org>
Message-ID: <545F9F2C.7070608@w3.org>
As you may know, these days most of my time is no longer W3C-staff but 
is funded research toward building "Crosscloud", an architecture for 
software which allows users to control their data and which should 
encourage innovation by making it much easier to develop powerful, open 
multi-user (social) software.

Back in January, we started off building on LDP, with Andrei creating 
cimba.co.  It's a microblogging app, intending to replicate some of 
early Twitter, in a completely decentralized way using generic 
(non-application-specific) LDP.    To make that work, we had to extend 
LDP with Access Control; clients can tell the server who can do what 
with each research.   We also made no use of Direct Containers, Indirect 
Containers, or Paging.   It's just Basic Containers, Access Control, 
WebID-TLS for client authentication, Turtle for data, and non-RDF 
resources for photos.     (Maybe I'm forgetting some details; for demo 
see 2 minute video at [1].)

While cimba basically works, it's painful in various ways and unable to 
do many things, showing us that we need much more support from the 
servers.    We've also started building several more apps which are 
showing other things that are important to have.

We don't have it all figured out yet, let alone implemented, but here 
are a few of the thing we probably need.   I'm providing this list to 
help with re-chartering, although most of these are not yet mature 
enough for standardization.   Maybe they will be in 6-12 months, though. 
    As you look at this list, one thing to figure out is how will we 
know when this module is ready for the WG to take up.

== 1.  Queries

This is a big one.   It's impractical to have the cimba WebApp, running 
in the browser, do all the GETS (hundreds, at least) every time it 
starts.  It needs to do a small number of queries, and have the server 
manage all the aggregation.   The server has to be able to query across 
other servers as well as itself.

We're currently playing with forms of "Link-Following SPARQL", but also 
with a more restricted MongoDB-like query language, both for easier 
implementation and for response-time/load guarantees.

Queries make resource-paging obsolete, which is why I've lost interest 
in paging.

== 2.  Change Notification to Web Servers

If a server acting on behalf of the end-user is going to aggregate data 
from other servers, it needs to be able to keep its copy in sync. 
Traditional web cache + polling works only when it's okay to be seconds 
or minutes out of date; many multi-user apps require much more 
responsiveness than that, so we see a need for one server to be able to 
subscribe to change notification from another.

One might want something like PATCH to make this more efficient, but at 
the moment it looks like we can keep the resources small enough that it 
doesn't matter.

== 3.  Change Notification to Web Clients

Similarly, Web Apps often need to know immediately when data has 
changed.   While it might be nice to have this be the same protocol as 
(2), our preliminary investigation suggests the engineering trade-offs 
make that impractical.   So, this needs to be its own protocol. 
Probably it's just a tweak to the query protocol where query results, 
rather than being a single response collecting all the results, are 
ongoing add-result and remove-result events.

== 4.  Operation over WebSockets

It almost certainly makes sense to use WebSockets for (3), but it also 
makes sense to use them for all the current LDP operations for high 
performance.     A modest client and server can probably process at 
least 1000 GETs per second, but in practice, without WebSockets, they'll 
be slowed an order of magnitude because of round trip delays.    That 
is, say RTT is 50ms, so we can do 20 round trips per second.    Most 
browsers allow at most 6 connections per hostname [2], so that's 120 
round trips per second, max, no matter how much CPU and RAM and 
bandwidth you have.

I'm still thinking about what this might look like.     Strawman is 
something like each client-to-server message is a JSON object like { 
"verb": "GET", "resource":"http://example.org", "accept": "text/html", 
"seq":7 } and response are like { "in-reponse-to": 7, "status": 200, 
"contentType": "text/html", "content": "<html>......</html>" }

So the higher levels don't have to know it's not normal HTTP, *but* we 
can have hundreds or thousands of requests pipelined.     Also, we can 
have multiple responses, or something, for event notification.   This 
would also allow for more transactional operation, if desired.   (Maybe 
"partial-response-to" and "final-response-to".)

== 5.  Non-Listing Containers

I want end-points that I can POST to, and GET some information about, 
without being swamped by an enumeration of everything posted there.   I 
don't want to have to include a Prefer header to avoid that swamping.

You might consider this a taste, but I think it's an important usability 
issue.

Again, with querying, you probably don't want to just be dumping the 
list of contained resources.   Querying also lets us control inlining, 
etc.   Basically, if querying is available, I think we can skip 
serializing membership/containment triples.

== 6.  PUT-to-Create

There are situations where the client needs to lay out, on the server, 
an assortment of resources with carefully controlled URLs, such as a 
static website with interlinked html, css, js, images, etc.    This 
should be doable with PUT, where PUT creates the resource inside the 
container that owns that URL space.

== 7.  DELETE WHERE

One of our current demo apps is a game that is likely to generate a 
dozen resources per second per user.   Asking for each of those 
resources to be individually deleted afterwards seems rather silly, even 
problematic, so a DELETE WHERE operation would be nice.

Yes, one could put them all in a container in this case, and define it 
as a kind of container that deletes its contained resources when it's 
deleted,, but there are situations where that wont work as well.  Maybe 
we want to delete the resources after about 60 seconds have gone by, for 
example.   Easy to do with a DELETE WHERE, hard to do otherwise.

==  8.  WebMention for Data, backlinks used in Queries

The basics of WebMention are in-scope for the Social Web WG, but it's 
not clear they'll apply it to arbitrary raw data, or say how the 
back-links are made available for use in queries.   Like many of these, 
this might be joint work with SWWG.

==  9.  Client Authentication

Arguable this is quite out of scope, and yet it's hard to operate 
without it.   Especially things like (2) are easier with some kind of 
authentication.

For a strawman of how easy it could be: 
https://github.com/sandhawke/spot/blob/master/spec.md

== 10.  Access Control

Obviously.

My current radical theory is I only need is a flag that a page is 
owner-only, public, or group-read, and then a way to define the group of 
identities (see (9)) who can read it.    Most people imagine we need to 
control a lot more than read access, and perhaps we do, but I'm 
currently working with the theory that everyone makes their own 
contributions in their own space, notifying but never actually "writing" 
to anyone else's.

== 11.  Combined Metadata and Content operations

I don't think I can put this very crisply, but I've started thinking 
about resources as looking like this:

{ property1: value1,
    property2: value2,
    ...
    content: "<html>....</html>",
    contentType: "text/html"
    ...
}

and it's so much nicer.   Basically, every resource is properties-value 
pairs, and some of that pv data is "content".    If you don't do 
something like this, queries and notifications and all that require us 
to bifurcate into a mechanism that's all about the content and another 
that's all about the metadata.

LDP-RS's then become content-free resources, or null-content resources, 
but much less fundamentally different.   With the current LDP framing, 
what happens when you PUT an image to an LDP-RS or PUT rdf to what you 
created as an image?   This model clears that up nicely.

But this might only work in the face of other assumptions I'm making, 
like the only triples at <R> are in a graph rooted at <R>, so you can 
think of them all as properties of R.    Also I've resolved httpRange-14 
by saying I'm only interested in proper information-resource-denoting 
URLs, and you can use indirect properties for talking about people, 
places, events, etc.    Maybe those radical assumptions are necessary 
for making this work.

12.  Forwarding

We need to be able to move resources, because it's very hard to pick a 
URL and stick to it for decades.   And if it's used as part of other 
apps, and you don't stick to it, you'll break them.   The fear of this 
will, I suspect, significantly impede adoption.

I propose three mechanisms.   Any one of them might work; between the 
three I'm fairly confident.

1.  Servers SHOULD check all their outgoing links at least once every 30 
days.   If they get a 301 response, they SHOULD update the link in 
place.   Valid reason not to change it is this is some kind of a 
frozen/static page that can't be changed.

2.  When a client gets a 301, following a link it got from server A, it 
should notify server A, so A can rewrite the link sooner.   This could 
use a .well-known end-point on A, or there could be a 
Report-Link-Issues-To header on every resource which A serves telling 
clients how to report any 301s (and 404s) it finds.

3.  The notification mechanism (2) above, should include move 
notifications, so when a page is being watched, if it moves the watcher 
will be immediately notified and able to change its link.

All this works much better if in addition to 301 we have a way to say a 
whole tree has moved.    That is, all URLs starting 
http://foo.example/x/ should not be considered redirected to 
http://bar.example/y/, etc.

With these mechanisms in place, links from compliant servers should 
start to transition quickly and drop off to zero after 30 days. 
Obviously links from hand-maintained resources, and printed on paper, 
etc, wont change, but those are usually consumed by humans who are 
better able to deal with a broken link anyway.

== More...

I'm sure there's more, but this gives the general shape of things. Do we 
want the new charter to target some of these?   To allow for some of 
these?   And again: how do we assess when each of these is mature enough 
for a WG to begin looking at it?

Thanks for considering this.

       -- Sandro


[1] https://www.youtube.com/watch?v=z0_XaJ97rF0
[2] http://www.browserscope.org/?category=network&v=top
Received on Sunday, 9 November 2014 17:07:00 UTC