Re: Requesting reviews of Provenance Access and Query document. from Graham Klyne on 2013-03-27 (public-ldp@w3.org from March 2013)

From: Graham Klyne <Graham.Klyne@zoo.ox.ac.uk>
Date: Wed, 27 Mar 2013 12:48:03 +0000
To: Erik Wilde <dret@berkeley.edu>
CC: LDP <public-ldp@w3.org>, W3C provenance WG <public-prov-wg@w3.org>
Message-ID: <5152EA83.7080600@zoo.ox.ac.uk>
Hi Erik,

Thanks for your comments.

I think the point of your message is concern for scalability because of resource 
requirements for handling unknown numbers of subscriptions in a pub/sub message 
push environment.

My short response is that the mechanism described is not a pub/sub mechanism, in 
that there is no subscription, so those concerns do not apply.

(I think it's entirely possible that pingback can be used *with* a pub-sub 
service, and the scalability issues could indeed be of concern for any pub-sub 
mechanism used, but that's a separate discussion.)

More details below.

On 26/03/2013 21:26, Erik Wilde wrote:
> hello graham.
>
> thanks for your email!
>
> On 2013-03-14 9:44 , Graham Klyne wrote:
>> The section on pingbacks
>> (http://www.w3.org/TR/2013/WD-prov-aq-20130312/#forward-provenance) is
>> intended to provide a way for a publisher to learn about additional
>> provenance related to a published resource.  We would be interested to
>> hear from web services experts if they have any experience of using HTTP
>> in this way, and if there are any known problems with the proposed
>> approach.  (The PROV WG has agreed to drop the implied directionality in
>> the name used and description.)
>
> if understand this correctly, this is supposed to be some kind of push
> mechanism, instead of the usual pull model. there is little in terms of
> standardized/widely deployed technology on the web so far. browsers have been
> using "long pulls", but that's not very scalable and mostly because of some
> restrictions inherent to browsers.

If you mean "push mechanism" in the sense of being initiated by the provider of 
information, then yes.  But it is very different to techniques like long polling 
in that the provider of information is also the client (initiator) of the 
transaction.  In this, I think it's no different to any other HTTP POST or PUT 
operation.

For long-polling (and pub-sub), the recipient of information is assumed to have 
some a priori awareness of the provider.  The pingback mechnism is the other way 
round: the provider is assumed to have a priori awareness of the recipient.

Push mechanisms are often used as a way to avoid the inefficiencies (or 
effort/latency trade-off) of polling; pingback is different: it is designed 
allow discovery of information that is not generally discoverable through polling.

>
> the connection to LDP is a very interesting one, because there could be an
> interesting opportunity to leverage LDP's model. for this, i'll explain how this
> actually does work in Atom (which has a similar model of collections/entries).
> Atom provides feeds that most often are sorted by date. PuSH (PubSubHubbub, a
> now defunct google activity) defined a model that allowed people for subscribe
> to feeds by registering a callback URI. for any update in the feed, the PuSH
> server would package the update as an Atom entry and then POST it to the
> callback URI.
>
> this being a pubsub model, this means that the PuSH servers much maintain
> subscriber lists (of all callback URIs). in PuSH, this can be layered, because a
> feed can advertise a hub for it (where clients can go and subscribe). While PuSH
> worked, it never gained critical mass, and was hampered by the fact that there
> was no standardized protocol how to subscribe/unsubscribe, so that was left for
> implementers to figure out. a more promising protocol should probably cover this
> aspect as well.
>
> to summarize: when LDP is stable, it would be conceivable for LDP services to
> support a similar service: clients interested in updates would subscribe to a
> URI, and would get pushed updates in the form of LDP data (which would be
> exactly the same as they would have gotten when GETting the updates resource),
> thanks to the RESTfu design of the protocol: URIs are the interaction points for
> resources, and we can build protocols (such as this LDP/PuSH design) on top of it.
>
> in fact, some PuSH implementations were even smart enough to batch push
> messages: when a client subscribed to multiple collections, or several updates
> happened, they would send "batch updates" that would be POSTed to the callback
> URI. the listening "client" would then act as if it had seen multiple updates
> getting published in the feed (had it used pull interactions).
>
> LDP is definitely pull only, allowing you to GET resources at well defined URIs
> (GET the collection and GET all updates, GET individual resources and GET all
> data about them), so we will provide the right foundation in terms of a RESTful
> design. layering LDPush on it actually would be a nice validation of the
> benefits of RESTfu design, but would require additional protocol parts
> (probably) such as how to handle subscription and unsubscription.
>
> implementation issues also arise in terms of scalability: how to deal with
> millions of subscribers? many PuSH implementations chose to handle this
> pragmatically and just automatically cancel subscriptions (requiring clients to
> refresh periodically), thus making it easier for servers to deal with the
> problem of subscriptions piling up because clients subscribed and never bothered
> to unsubscribe.

A difference between what is proposed for provenance pingback and pub/sub 
mechanisms is that there is no "fan out" of data, and hence no requirement to 
record subscriptions.  It's more the reverse, declaring a kind of collection 
point for information, but (intentionally) being quite silent about what the 
recipient may do with that data.  As such, it's more of a discovery mechanism 
than a propagation mechanism.

So, while I can appreciate that there may be applications that use pingpacks in 
conjunction with pub/sub (or other distribution mechanisms), I don't think such 
considerations have any direct bearing on the pingback as described.  If LDP 
does, in due course, introduce frameworks that support pub-sub distribution, I 
would see the pingback as being complementary: some systems may choose to pass 
on information from incoming pingbacks to a set of subscribers using these 
mechanisms.

I also recognize that the pingback mechanism may be used as part of a larger 
pub-sub framework (in that subscriptions may be created for pingback resources), 
and any system that uses pub-sub in such a way will indeed to be aware of 
subscription scaling issues.  But such use is not required by, and is outside 
the scope of, the mechanism specified.

Thus, I feel the only scaling concern for pingback services as described is 
whether they can deal with the potential numbers of incoming messages.  This is 
covered somewhat in the security considerations section.  In particular, there 
is no requirement on a server to do anything in particular with a pingback, so 
it is free to take steps to protect its resources from abuse.

#g
--
Received on Wednesday, 27 March 2013 12:53:35 UTC