Efficient Data discovery and Sync Support - proposed method and Sindice implementation from Giovanni Tummarello on 2010-07-09 (public-lod@w3.org from July 2010)

From: Giovanni Tummarello <giovanni.tummarello@deri.org>
Date: Fri, 9 Jul 2010 02:24:02 +0200
To: Linking Open Data <public-lod@w3.org>
Message-ID: <AANLkTimAU3pSE4MuwQPKMWpk1gswlOc9qoNByI8kypvO@mail.gmail.com>

Apologies for cross posting
---------

Dear all

So far semantic web search engines and semantic aggregation services have
been inserting datasets by hand or have been based on "random walk" like
crawls with no data completeness or freshness guarantees.

After quite some work, we are happy to announce that Sindice is now
supporting effective large scale data acquisition with *efficient syncing*
capabilities based on already existing standards (a specific use of  the
sitemap protocol).

For example if you publish 300000 products using RDFa or whatever you want
to use (microformats,  303s etc), by making sure you comply to the proposed
method, Sindice will now guarantee you

a) to crawl your dataset completely (might take some time since we do this
"politely")
b) ..but only crawl you once and then get just the updated URLs on a daily
bases! (so timely data update guarantee)

So this is not "Crawling" anymore, but rather a live "DB like" connection
between remote, diverse dataset all based on http. in our opinion this is a
*very* important step forward for semantic web data aggregation
infrastructures.

The specification we support (and how to make sure you're being properly
indexed) are published here  (pretty simple stuff actually!)

http://sindice.com/developers/publishing

and results can be seen from websites which are already implementing these
(you might be already doing that indeed without knowing..)

http://sindice.com/search?q=domain:www.scribd.com+date:last_week&qt=term

Why not make sure that your site can be effectively kept in sync today?

As always  we look forward for comments, suggestions and ideas on how to
serve better your data needs (e.g. yes, we'll also support Openlink dataset
sync proposal once the specs are finalized). Feel free to ask specific
questions about this or any other Sindice related issue on our dev forum
http://sindice.com/main/forum

Giovanni,
on behalf of the Sindice team http://sindice.com/main/about. Special credits
for this to Tamas Benko and Robert Fuller.

p.s. we're hiring

Received on Friday, 9 July 2010 00:24:30 UTC