- From: Nathan <nathan@webr3.org>
- Date: Tue, 17 Nov 2009 18:27:21 +0000
- To: Niklas Lindström <lindstream@gmail.com>
- CC: Georgi Kobilarov <georgi.kobilarov@gmx.de>, public-lod@w3.org
very short non-detailed reply from me! pub/sub, atom feeds, RDF over XMPP were my initial thoughts on the matter last week - essentially triple (update/publish) streams on a pub/sub basis, decentralized suitably, [snip] then my thoughts switched to the fact that RDF is not XML (or any other serialized format) so to keep it non limited I guess the concept would need to be specified first then implemented in whatever formats/ways people saw fit, as has been the case with RDF. this subject is probably not something that should be left for long though.. my (personal) biggest worry about 'linked data' is that junk data will be at an all time high, if not worse, and not nailing this on the head early on (as in weeks/months at max) could contribute to the mess considerably. regards! Niklas Lindström wrote: > Hi Georgi, all, > > I'm writing a long response here, partly because I aleady started > formulating one related to a previous post to this list: > <http://lists.w3.org/Archives/Public/public-lod/2009Nov/0003.html>. > > I'm working with the Swedish legal information system, where we've > settled on using Atom feeds to incrementally collect resources from > about a hundred agencies over time (one of which is currently > publishing in this format, the rest are to follow over time). > > This is done with a simple "timeline of resources" approach, including > deletes (no kind of real diffs under the "representation" level). > > > == Approach == > > I said *resources* because we gather all representations of a given > resource. We require them to publish RDF, and then at least an > alternate representation in PDF form (which is the state of the art > here; hopefully RDFa or similar will come to the table in the future). > > To achieve this, we combine Feed Archives (an RFC: > <http://tools.ietf.org/html/rfc5005#section-4>) and Tombstones > (currently just an I-D: > <http://tools.ietf.org/html/draft-snell-atompub-tombstones-06>.) The > method is simple: > > 1. Publish a subscription feed where all new and modified entries > appear. Add a tombstone if a mispublication has been made (either a > malformed entry ID or really *wrong* data; you can publish this again > when it's ok). > > 2. Make sure, if the feed becomes too long, to cut it off into > archives preserving the order in time of publications/updates/deletes. > It is technically ok to remove outdated items, we leave it as a > normative decision whether older changes must be traceable. > > (There are no hard requirements for preserving outdated > representations. That means outdated but preserved entries in archives > are merely historical; what is easier mostly depends on the underlying > implementation -- flat file archives keep everything, non-versioned > databases only really need to keep track of deletes and the > min-/max-updated scope for each archived page).) > > 3. To collect, follow the outline in RFC 5005, or basically: climb > feed backwards, keep track of id/-updated pairs from entries (and > tombstones), keeping the youngest for each ID, and then collect every > entry (linked content from every entry), preferably forwards in time > (to ensure that your copy of the collected dataset is built up in the > same relative temporal order (for each source, not necessarily between > them)). > > There has been no scope in the project for taking a general approach > of standardizing and promoting this use, but I have personally set up > COURT ("Crafting Organization Using Resources and Time") at > <http://code.google.com/p/court/> as a possible means of doing this. > (As mentioned there, there are a bunch of more or less similar/related > efforts (some very large in scope, which I'm kinda weary of).) > > > == Details == > > We found the entry concept in Atom feeds perfect to constitute a > "resource manifest". They publish them with the entry ID being the > canonical URI of a legal document (with the common base URI being our > -- not yet live -- system). Entries contain content and alternate > links pointing to the different representations stored at their > websites. > > We "republish" every entry collected under their canonical URI:s, with > conneg etc. in place, and possible services (such as SPARQL > interfaces) built as indexers upon this data store. This can easily be > done since it itself publishes a subscription feed (plus archives) > with these entries (with timestamps corresponding to then we collected > the respective entries). This will be the official data source of > legal information, incrementally updated and persistent over time. > > Using enclosure links, entries can also carry "attachements" such as > appendices and/or images or the like -- a very simple way of carrying > compound documents (but with any semantics relating to the composition > being descibed in the delivered RDF). > > Another benefit I've found with this is that I can publish our own > tiny "datasets" (agency descriptions *including the agency source > feeds* and other supporting data) as just another source (the "admin > feed"). With entry enclosure links I can attach more RDF partitioned > to our needs/restrictions, and just wipe entries if they become to > large and publish new "repartitioned" resources carrying RDF. > > (In theory this also means that the "central system" can be replaced > with a PURL-like redirector, if the agency websites could be deemed > persistent over time (which they currently cannot).) > > > == Other approaches == > > * Library of Congress have similar Atom feeds and tombstones for their > subject headings: <http://id.loc.gov/authorities/feed/> (paged feeds; > no explicit archives that I'm aware of, so I'm not sure about the > collectability of the entire dataset over time -- this can be > achiveved with regular paging if you're sure you won't drop items when > climbing as the dataset is updated). > > * The ORA-PMH <http://www.openarchives.org/pmh/> is an older effort > with good specifications (though not as RESTful as e.g. Atom, GData > etc). I'm interested in seeing if they'd be interested in something > like COURT as well (since they went for Atom (and RDF) in their > ORA-ORE specs <http://www.openarchives.org/ore/>.. > > * You can use Sitemap extensions > <http://sw.deri.org/2007/07/sitemapextension/> to expose lists of > archive dumps (e.g. <http://products.semweb.bestbuy.com/sitemap.xml>), > which could be crawled incrementally. But I don't know how to easily > do deletes without recollecting it all.. > > * The "COURT" approach of our system has a rudimentary "ping" feature > so that sources can notify the collector of updated feeds. This could > of course be improved by using PubSubHubbub > <http://pubsubhubbub.googlecode.com/svn/trunk/pubsubhubbub-core-0.2.html>, > but that's currently not a priority for us. > > > Best regards, > Niklas Lindström > > PS. Anyone interested in this COURT approach, *please* contact me; I > am looking for ways to formalize this for easy reuse, not the least > for disseminating government and other open data in a uniform manner. > Both on a specification/recommendation level, and for gathering > implementations (possibly built upon existing frameworks/content > repos/cms:es). > > > > On Tue, Nov 17, 2009 at 4:45 PM, Georgi Kobilarov > <georgi.kobilarov@gmx.de> wrote: >> Hi all, >> >> I'd like to start a discussion about a topic that I think is getting >> increasingly important: RDF update feeds. >> >> The linked data project is starting to move away from releases of large data >> dumps towards incremental updates. But how can services consuming rdf data >> from linked data sources get notified about changes? Is anyone aware of >> activities to standardize such rdf update feeds, or at least aware of >> projects already providing any kind of update feed at all? And related to >> that: How do we deal with RDF diffs? >> >> Cheers, >> Georgi >> >> -- >> Georgi Kobilarov >> www.georgikobilarov.com >>
Received on Tuesday, 17 November 2009 18:28:27 UTC