Re: Keeping crawlers up-to-date

Possibly relevant:
http://www.ietf.org/rfc/rfc5005.txt

Feed paging and archiving for Atom feeds. Paging is a nice solution to  
the "small window" problem with syndication feeds. The concept might  
be translatable to RSS 1.0.

Although I have to say that I find the idea of pushing RDF updates via  
Atom quite appealing.

Richard


On 28 Apr 2009, at 17:01, Yves Raimond wrote:

> Hello!
>
>> I think the two main options are either to publish a feed containing
>> pointers to changes, or using a messaging system to push out  
>> notifications.
>>
>> Despite the recent discussion around benefits of, say, Jabber or  
>> other
>> mechanisms for pushing out notifications, I think that a more RESTful
>> approach using RSS or Atom feeds might be nicer. Then we can focus  
>> on the
>> resource design, i.e. what kinds of changes do we need to publish.
>>
>> So for example for /programmes it may be sufficient to publish a  
>> set of
>> feeds for new, e.g. brands, episodes, versions, etc. These could be  
>> RSS 1.0
>> and then include additional RDF data as appropriate.
>
> My only concern about this is that you need to limit the number of
> items in the feed. If you have a sudden burst of activity and the
> crawler just ping the feed at regular intervals, it may miss some
> updates. However, even with 1M updates in a day, with a feed capped to
> 100 items would just need the crawlers to ping the feed about every
> hour and a half. So that's not too bad.
> (Just noticed that Soren's proposal includes pagination of feeds,
> which might solve that problem).
>
> So yes, I guess it could be done, using RDF feeds e.g.
> http://www.bbc.co.uk/programmes/updates/2009/04/28/brands.rdf etc.
> We'd need to carefully think about the feeds we offer though.
>
> Cheers!
> y
>
>>
>> This has the added advantage that a crawler that only wanted to  
>> collect
>> certain information, e.g. about brands, could monitor just the  
>> resource(s)
>> it was interested in. Similarly with careful resource design, the  
>> timing of
>> updates could also be under the control of the crawler, e.g. new  
>> versions in
>> last 12 hours, 24 hours, 7 days (avoiding a massive firehose of  
>> updates).
>> This could be easily done with URIs and avoids having to build that  
>> into the
>> messaging system.
>>
>> Interested to know what you think.
>>
>> Cheers,
>>
>> L.
>>
>> 2009/4/28 Yves Raimond <yves.raimond@gmail.com>
>>>
>>> Hello!
>>>
>>> I know this issue has been raised during the LOD BOF at WWW 2009,  
>>> but
>>> I don't know if any possible solutions emerged from there.
>>>
>>> The problem we are facing is that data on BBC Programmes changes
>>> approximately 50 000 times a day (new/updated
>>> broadcasts/versions/programmes/segments etc.). As we'd like to  
>>> keep a
>>> set of RDF crawlers up-to-date with our information we were  
>>> wondering
>>> how best to ping these. pingthesemanticweb seems like a nice option,
>>> but it needs the crawlers to ping it often enough to make sure they
>>> didn't miss a change. Another solution we were thinking of would  
>>> be to
>>> stick either Talis changesets [1] or SPARQL/Update statements in a
>>> message queue, which would then be consumed by the crawlers.
>>>
>>> Did anyone tried to tackle this problem already?
>>>
>>> Cheers!
>>> y
>>>
>>>
>>> [1] http://n2.talis.com/wiki/Changeset
>>>
>>> Please consider the environment before printing this email.
>>>
>>> Find out more about Talis at www.talis.com
>>>
>>> shared innovationTM
>>>
>>> Any views or personal opinions expressed within this email may not  
>>> be
>>> those of Talis Information Ltd or its employees. The content of  
>>> this email
>>> message and any files that may be attached are confidential, and  
>>> for the
>>> usage of the intended recipient only. If you are not the intended  
>>> recipient,
>>> then please return this message to the sender and delete it. Any  
>>> use of this
>>> e-mail by an unauthorised recipient is prohibited.
>>>
>>> Talis Information Ltd is a member of the Talis Group of companies  
>>> and is
>>> registered in England No 3638278 with its registered office at  
>>> Knights
>>> Court, Solihull Parkway, Birmingham Business Park, B37 7YB.
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the MessageLabs Email Security  
>>> System.
>>> For more information please visit http://www.messagelabs.com/email
>>> ______________________________________________________________________
>>
>>
>>
>> --
>> Leigh Dodds
>> Programme Manager, Talis Platform
>> Talis
>> leigh.dodds@talis.com
>> http://www.talis.com
>>
>

Received on Tuesday, 28 April 2009 18:28:54 UTC