Re: Keeping crawlers up-to-date

Hello!

> I think the two main options are either to publish a feed containing
> pointers to changes, or using a messaging system to push out notifications.
>
> Despite the recent discussion around benefits of, say, Jabber or other
> mechanisms for pushing out notifications, I think that a more RESTful
> approach using RSS or Atom feeds might be nicer. Then we can focus on the
> resource design, i.e. what kinds of changes do we need to publish.
>
> So for example for /programmes it may be sufficient to publish a set of
> feeds for new, e.g. brands, episodes, versions, etc. These could be RSS 1.0
> and then include additional RDF data as appropriate.

My only concern about this is that you need to limit the number of
items in the feed. If you have a sudden burst of activity and the
crawler just ping the feed at regular intervals, it may miss some
updates. However, even with 1M updates in a day, with a feed capped to
100 items would just need the crawlers to ping the feed about every
hour and a half. So that's not too bad.
(Just noticed that Soren's proposal includes pagination of feeds,
which might solve that problem).

So yes, I guess it could be done, using RDF feeds e.g.
http://www.bbc.co.uk/programmes/updates/2009/04/28/brands.rdf etc.
We'd need to carefully think about the feeds we offer though.

Cheers!
y

>
> This has the added advantage that a crawler that only wanted to collect
> certain information, e.g. about brands, could monitor just the resource(s)
> it was interested in. Similarly with careful resource design, the timing of
> updates could also be under the control of the crawler, e.g. new versions in
> last 12 hours, 24 hours, 7 days (avoiding a massive firehose of updates).
> This could be easily done with URIs and avoids having to build that into the
> messaging system.
>
> Interested to know what you think.
>
> Cheers,
>
> L.
>
> 2009/4/28 Yves Raimond <yves.raimond@gmail.com>
>>
>> Hello!
>>
>> I know this issue has been raised during the LOD BOF at WWW 2009, but
>> I don't know if any possible solutions emerged from there.
>>
>> The problem we are facing is that data on BBC Programmes changes
>> approximately 50 000 times a day (new/updated
>> broadcasts/versions/programmes/segments etc.). As we'd like to keep a
>> set of RDF crawlers up-to-date with our information we were wondering
>> how best to ping these. pingthesemanticweb seems like a nice option,
>> but it needs the crawlers to ping it often enough to make sure they
>> didn't miss a change. Another solution we were thinking of would be to
>> stick either Talis changesets [1] or SPARQL/Update statements in a
>> message queue, which would then be consumed by the crawlers.
>>
>> Did anyone tried to tackle this problem already?
>>
>> Cheers!
>> y
>>
>>
>> [1] http://n2.talis.com/wiki/Changeset
>>
>> Please consider the environment before printing this email.
>>
>> Find out more about Talis at www.talis.com
>>
>> shared innovationTM
>>
>> Any views or personal opinions expressed within this email may not be
>> those of Talis Information Ltd or its employees. The content of this email
>> message and any files that may be attached are confidential, and for the
>> usage of the intended recipient only. If you are not the intended recipient,
>> then please return this message to the sender and delete it. Any use of this
>> e-mail by an unauthorised recipient is prohibited.
>>
>> Talis Information Ltd is a member of the Talis Group of companies and is
>> registered in England No 3638278 with its registered office at Knights
>> Court, Solihull Parkway, Birmingham Business Park, B37 7YB.
>>
>> ______________________________________________________________________
>> This email has been scanned by the MessageLabs Email Security System.
>> For more information please visit http://www.messagelabs.com/email
>> ______________________________________________________________________
>
>
>
> --
> Leigh Dodds
> Programme Manager, Talis Platform
> Talis
> leigh.dodds@talis.com
> http://www.talis.com
>

Received on Tuesday, 28 April 2009 16:02:35 UTC