- From: Leigh Dodds <leigh.dodds@talis.com>
- Date: Sun, 18 Apr 2010 16:40:41 +0100
- To: giovanni.tummarello@deri.org
- Cc: Linking Open Data <public-lod@w3.org>, "Benko, Tamas" <tamas.benko@deri.org>
Hi, On 17 April 2010 12:22, Giovanni Tummarello <g.tummarello@gmail.com> wrote: > i tell you what we're going to be supporting in Sindice very soon and > it would be great if you could add it to the table: > > simple existing sitemaps :-). Sitemaps provide the list of URLs to > crawl and for each one either a "last updated" field or "update > frequerncy". > > If the website cares to update the last updated properly then also > huge datasets can be kept in sync on daily (or less) bases. > > by publishing RDF in entity based slices (HTML + RDFa) the mechanism > simply works fine and it is the same large web publishers have been > using for years to expose the deep web so it is not difficult to > explain etc. > > for large datasets which are large RDF files, the Semantic Sitemap > extention does its job for us (dbpedia and many others are in Sindice > because of that) > > What do you think? > ... Yes, directed and undirected crawling needs to be included. I've tweaked the spreadsheet into two worksheets: * Approaches for mirroring data, e.g exports, crawling, etc * Approaches for syndicating notifications/changes The latter is what I had originally, but the mirroring aspects are new. I've included semantic sitemaps on there, along with simple dataset exports, BitTorrent, etc. A system may choose to just regularly mirror a dataset using a dump, or via a crawl. Or it may compare an initial mirror with synchronisation via further update notifications. Hopefully the new spreadsheet helps tease some of that out: http://spreadsheets.google.com/pub?key=tLWdskoM-2--vLjUI05e7qQ&output=html Cheers, L. -- Leigh Dodds Programme Manager, Talis Platform Talis leigh.dodds@talis.com http://www.talis.com
Received on Sunday, 18 April 2010 15:46:45 UTC