W3C home > Mailing lists > Public > public-lod@w3.org > April 2009

Re: Keeping crawlers up-to-date

From: Yves Raimond <yves.raimond@gmail.com>
Date: Tue, 28 Apr 2009 15:05:49 +0100
Message-ID: <82593ac00904280705s47acf861l1f01f00104b3cda8@mail.gmail.com>
To: Kingsley Idehen <kidehen@openlinksw.com>
Cc: Linking Open Data <public-lod@w3.org>, Nicholas J Humfrey <njh@aelius.com>, Patrick Sinclair <metade@gmail.com>
On Tue, Apr 28, 2009 at 2:55 PM, Kingsley Idehen <kidehen@openlinksw.com> wrote:
> Yves Raimond wrote:
>> Hello!
>> I know this issue has been raised during the LOD BOF at WWW 2009, but
>> I don't know if any possible solutions emerged from there.
>> The problem we are facing is that data on BBC Programmes changes
>> approximately 50 000 times a day (new/updated
>> broadcasts/versions/programmes/segments etc.). As we'd like to keep a
>> set of RDF crawlers up-to-date with our information we were wondering
>> how best to ping these. pingthesemanticweb seems like a nice option,
>> but it needs the crawlers to ping it often enough to make sure they
>> didn't miss a change.
> What's wrong with that ? :-)
> If PTSW works then consumers should just ping it based on their solution
> change sensitivity thresholds.

Well, if you want to keep track of a massive amount of updates, that
might get quite scary... Especially as I don't think there is a way to
filter the feed to a particular set of RDF documents.

>> Another solution we were thinking of would be to
>> stick either Talis changesets [1] or SPARQL/Update statements in a
>> message queue, which would then be consumed by the crawlers.
> An addition option if for the HTML information resources to be crawled as
> per usual with RDF aware crawlers using RDF discovery patterns to locate RDF
> information resource represenations via <link/> .

Yes, or just an RDF crawler. But unless you have massive resources, it
would be impossible to keep an image exactly in sync. And if you do,
that would hit BBC Programmes really hard.

Received on Tuesday, 28 April 2009 14:06:30 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:20:46 UTC