Re: RDF Update Feeds from Peter Ansell on 2009-11-21 (public-lod@w3.org from November 2009)

From: Peter Ansell <ansell.peter@gmail.com>
Date: Sun, 22 Nov 2009 07:00:37 +1000
To: Michael Hausenblas <michael.hausenblas@deri.org>
Cc: Hugh Glaser <hg@ecs.soton.ac.uk>, Georgi Kobilarov <georgi.kobilarov@gmx.de>, Linked Data community <public-lod@w3.org>
Message-ID: <a1be7e0e0911211300n236ab48dy8796f39b379577a3@mail.gmail.com>

2009/11/21 Michael Hausenblas <michael.hausenblas@deri.org>:
> Georgi, Hugh,
>
>>> Could be very simple by expressing: "Pull our update-stream once per
>>> seconds/minute/hour in order to be *enough* up-to-date".
>
> Ah, Georgi, I see. You seem to emphasise the quantitative side whereas I
> just seem to want to flag what kind of source it is. I agree that  "Pull our
> update-stream once per seconds/minute/hour in order to be *enough*
> up-to-date" should be available, however I think that having the information
> regular/irregular vs. how frequent the update should be made available as
> well. My main use case is motivated from the LOD application-writing area. I
> figured that I quite often have written code that essentially does the same:
> based on the type of data-source it either gets a live copy of the data or
> uses already local available data. Now, given that data set publisher would
> declare the characteristics of their dataset in terms of dynamics, one could
> write such a LOD cache quite easily, I guess, abstracting the necessary
> steps and hence offering a reusable solution. I'll follow-up on this one
> soon via a blog post with a concrete example.

If you want to do polling based on single resources at regular (ie,
less than day), intervals then you are likely to flood the server just
looking for potential updates in cases where the server really doesn't
know how often a particular resources is going to be updated, such as
DBpedia-live where the update rate is completely reliant on the amount
of activity on Wikipedia which is likely to spike at certain times,
and then even out and possibly drop off for months at a time.

Using a change feed with clients polling once per period on a sliding
window feed will break down whenever the temporary update rate is so
fast that a full window on the feed passes before clients do
consecutive polls on the update feed. There is no way to guarantee
what the maximum update rate for DBpedia-live is, for example, so the
published update rate would have to simply be as often as the server
can handle based on the size of the RSS file required to publish
information about which resources have been recently updated. The main
reason that RSS isn't useful for consistency IMO is that it relies on
clients updating very regularly or else they actually miss out
permanently on information and the RSS reader application contains a
limited set of what was really published on the feed.

The mechanism that DBpedia-live uses to monitor Wikipedia might be a
candidate, however it still suffers from issues with clients dropping
out for periods of time and either missing updates or getting large
spikes when they come back online. If clients do not receive the
notifications for a day on DBpedia-live, could they possibly catch up
without performing a DOS on the server trying to poll all of the
announcements that they missed out on?

If this is going to work and minimise bandwidth usage, there needs to
be some mechanism to enable clients to check if information is newer
than the cached information without any actual RDF information being
transferred. Currently RDF databases don't support this, and it is
particularly hard to support where the GRAPH used in the database is
not meant to be a single document, such as <http://dbpedia.org> on
DBpedia.

Cheers,

Peter

Received on Saturday, 21 November 2009 21:01:11 UTC