Re: Keeping crawlers up-to-date

Hi Yves,

I think the two main options are either to publish a feed containing
pointers to changes, or using a messaging system to push out notifications.

Despite the recent discussion around benefits of, say, Jabber or other
mechanisms for pushing out notifications, I think that a more RESTful
approach using RSS or Atom feeds might be nicer. Then we can focus on the
resource design, i.e. what kinds of changes do we need to publish.

So for example for /programmes it may be sufficient to publish a set of
feeds for new, e.g. brands, episodes, versions, etc. These could be RSS 1.0
and then include additional RDF data as appropriate.

This has the added advantage that a crawler that only wanted to collect
certain information, e.g. about brands, could monitor just the resource(s)
it was interested in. Similarly with careful resource design, the timing of
updates could also be under the control of the crawler, e.g. new versions in
last 12 hours, 24 hours, 7 days (avoiding a massive firehose of updates).
This could be easily done with URIs and avoids having to build that into the
messaging system.

Interested to know what you think.

Cheers,

L.

2009/4/28 Yves Raimond <yves.raimond@gmail.com>

> Hello!
>
> I know this issue has been raised during the LOD BOF at WWW 2009, but
> I don't know if any possible solutions emerged from there.
>
> The problem we are facing is that data on BBC Programmes changes
> approximately 50 000 times a day (new/updated
> broadcasts/versions/programmes/segments etc.). As we'd like to keep a
> set of RDF crawlers up-to-date with our information we were wondering
> how best to ping these. pingthesemanticweb seems like a nice option,
> but it needs the crawlers to ping it often enough to make sure they
> didn't miss a change. Another solution we were thinking of would be to
> stick either Talis changesets [1] or SPARQL/Update statements in a
> message queue, which would then be consumed by the crawlers.
>
> Did anyone tried to tackle this problem already?
>
> Cheers!
> y
>
>
> [1] http://n2.talis.com/wiki/Changeset
>
> Please consider the environment before printing this email.
>
> Find out more about Talis at www.talis.com
>
> shared innovationTM
>
> Any views or personal opinions expressed within this email may not be those
> of Talis Information Ltd or its employees. The content of this email message
> and any files that may be attached are confidential, and for the usage of
> the intended recipient only. If you are not the intended recipient, then
> please return this message to the sender and delete it. Any use of this
> e-mail by an unauthorised recipient is prohibited.
>
> Talis Information Ltd is a member of the Talis Group of companies and is
> registered in England No 3638278 with its registered office at Knights
> Court, Solihull Parkway, Birmingham Business Park, B37 7YB.
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________
>



-- 
Leigh Dodds
Programme Manager, Talis Platform
Talis
leigh.dodds@talis.com
http://www.talis.com

Received on Tuesday, 28 April 2009 14:50:59 UTC