Re: RDF Update Feeds from Niklas Lindström on 2009-11-17 (public-lod@w3.org from November 2009)

From: Niklas Lindström <lindstream@gmail.com>
Date: Tue, 17 Nov 2009 18:44:45 +0100
To: Georgi Kobilarov <georgi.kobilarov@gmx.de>
Cc: public-lod@w3.org
Message-ID: <cf8107640911170944l3f61f3dcxda7446bc44fe5922@mail.gmail.com>
Hi Georgi, all,

I'm writing a long response here, partly because I aleady started
formulating one related to a previous post to this list:
<http://lists.w3.org/Archives/Public/public-lod/2009Nov/0003.html>.

I'm working with the Swedish legal information system, where we've
settled on using Atom feeds to incrementally collect resources from
about a hundred agencies over time (one of which is currently
publishing in this format, the rest are to follow over time).

This is done with a simple "timeline of resources" approach, including
deletes (no kind of real diffs under the "representation" level).


== Approach ==

I said *resources* because we gather all representations of a given
resource. We require them to publish RDF, and then at least an
alternate representation in PDF form (which is the state of the art
here; hopefully RDFa or similar will come to the table in the future).

To achieve this, we combine Feed Archives (an RFC:
<http://tools.ietf.org/html/rfc5005#section-4>) and Tombstones
(currently just an I-D:
<http://tools.ietf.org/html/draft-snell-atompub-tombstones-06>.) The
method is simple:

1. Publish a subscription feed where all new and modified entries
appear. Add a tombstone if a mispublication has been made (either a
malformed entry ID or really *wrong* data; you can publish this again
when it's ok).

2. Make sure, if the feed becomes too long, to cut it off into
archives preserving the order in time of publications/updates/deletes.
It is technically ok to remove outdated items, we leave it as a
normative decision whether older changes must be traceable.

(There are no hard requirements for preserving outdated
representations. That means outdated but preserved entries in archives
are merely historical; what is easier mostly depends on the underlying
implementation -- flat file archives keep everything, non-versioned
databases only really need to keep track of deletes and the
min-/max-updated scope for each archived page).)

3. To collect, follow the outline in RFC 5005, or basically: climb
feed backwards, keep track of id/-updated pairs from entries (and
tombstones), keeping the youngest for each ID, and then collect every
entry (linked content from every entry), preferably forwards in time
(to ensure that your copy of the collected dataset is built up in the
same relative temporal order (for each source, not necessarily between
them)).

There has been no scope in the project for taking a general approach
of standardizing and promoting this use, but I have personally set up
COURT ("Crafting Organization Using Resources and Time") at
<http://code.google.com/p/court/> as a possible means of doing this.
(As mentioned there, there are a bunch of more or less similar/related
efforts (some very large in scope, which I'm kinda weary of).)


== Details ==

We found the entry concept in Atom feeds perfect to constitute a
"resource manifest". They publish them with the entry ID being the
canonical URI of a legal document (with the common base URI being our
-- not yet live -- system). Entries contain content and alternate
links pointing to the different representations stored at their
websites.

We "republish" every entry collected under their canonical URI:s, with
conneg etc. in place, and possible services (such as SPARQL
interfaces) built as indexers upon this data store. This can easily be
done since it itself publishes a subscription feed (plus archives)
with these entries (with timestamps corresponding to then we collected
the respective entries). This will be the official data source of
legal information, incrementally updated and persistent over time.

Using enclosure links, entries can also carry "attachements" such as
appendices and/or images or the like -- a very simple way of carrying
compound documents (but with any semantics relating to the composition
being descibed in the delivered RDF).

Another benefit I've found with this is that I can publish our own
tiny "datasets" (agency descriptions *including the agency source
feeds* and other supporting data) as just another source (the "admin
feed"). With entry enclosure links I can attach more RDF partitioned
to our needs/restrictions, and just wipe entries if they become to
large and publish new "repartitioned" resources carrying RDF.

(In theory this also means that the "central system" can be replaced
with a PURL-like redirector, if the agency websites could be deemed
persistent over time (which they currently cannot).)


== Other approaches ==

* Library of Congress have similar Atom feeds and tombstones for their
subject headings: <http://id.loc.gov/authorities/feed/> (paged feeds;
no explicit archives that I'm aware of, so I'm not sure about the
collectability of the entire dataset over time -- this can be
achiveved with regular paging if you're sure you won't drop items when
climbing as the dataset is updated).

* The ORA-PMH <http://www.openarchives.org/pmh/> is an older effort
with good specifications (though not as RESTful as e.g. Atom, GData
etc). I'm interested in seeing if they'd be interested in something
like COURT as well (since they went for Atom (and RDF) in their
ORA-ORE specs <http://www.openarchives.org/ore/>..

* You can use Sitemap extensions
<http://sw.deri.org/2007/07/sitemapextension/> to expose lists of
archive dumps (e.g. <http://products.semweb.bestbuy.com/sitemap.xml>),
which could be crawled incrementally. But I don't know how to easily
do deletes without recollecting it all..

* The "COURT" approach of our system has a rudimentary "ping" feature
so that sources can notify the collector of updated feeds. This could
of course be improved by using PubSubHubbub
<http://pubsubhubbub.googlecode.com/svn/trunk/pubsubhubbub-core-0.2.html>,
but that's currently not a priority for us.


Best regards,
Niklas Lindström

PS. Anyone interested in this COURT approach, *please* contact me; I
am looking for ways to formalize this for easy reuse, not the least
for disseminating government and other open data in a uniform manner.
Both on a specification/recommendation level, and for gathering
implementations (possibly built upon existing frameworks/content
repos/cms:es).



On Tue, Nov 17, 2009 at 4:45 PM, Georgi Kobilarov
<georgi.kobilarov@gmx.de> wrote:
> Hi all,
>
> I'd like to start a discussion about a topic that I think is getting
> increasingly important: RDF update feeds.
>
> The linked data project is starting to move away from releases of large data
> dumps towards incremental updates. But how can services consuming rdf data
> from linked data sources get notified about changes? Is anyone aware of
> activities to standardize such rdf update feeds, or at least aware of
> projects already providing any kind of update feed at all? And related to
> that: How do we deal with RDF diffs?
>
> Cheers,
> Georgi
>
> --
> Georgi Kobilarov
> www.georgikobilarov.com
>
>
>
>
Received on Tuesday, 17 November 2009 17:50:47 UTC