Re: Indexing *.ac.uk

* Charles Christacopoulos <c.k.christacopoulos@DUNDEE.AC.UK> [2003-08-19 20:26+0100]
> Addressed to: web-support@jiscmail.ac.uk
>               website-info-mgt@jiscmail.ac.uk
> 
> Hi,
> 
> Apologies for crossposting.
> 
> In the next few days (late this week or early next week) we will start testing
> a robot on *.ac.uk to index whatever it finds. It has been tested at our domain
> but we need something a touch bigger than *.dundee.ac.uk. It will come under
> the name of Capek (Egothor's robot).  When the test are completed, the robot
> and its engine may become a more permanent feature ... until Google is
> defeated. :-)
> 
> I would appreciate if you could pass the above info. to any of your colleages or
> IT Directors who might be concerned about a sudden increase in traffic of your
> webservers.
> 
> --------------------------------------
> 
> The way the robot/engine works is that it tries to learn which are the pages
> which change more often than others so it visits them more frequently.
> In its current incarnation it visits those pages every 1/2 hour.  Other pages are
> visited less frequently and so on.

Would you consider making use of machine-readable metadata about site
update policies?

The RSS-DEV group developed an RDF vocabulary for use with RSS1.0 which
might be relevant. We could encourage folks to link to a robots.rdf from robots.txt
or with a LINK REL from homepages. See http://web.resource.org/rss/1.0/
-> http://web.resource.org/rss/1.0/modules/syndication/

This gives markup along lines of:

     <sy:updatePeriod>hourly</sy:updatePeriod>
     <sy:updateFrequency>2</sy:updateFrequency>
     <sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>
     
Something like this should be useful, though I guess you'd want a way to 
know the schedule for the entire site rather than for each page
separately. 

I'd also strongly encourage you to use RSS1 feeds (or any RSS feeds) to
gather info about new pages on the site, to allow your index to stay up
to date with info about recently changed pages on each site. There are 
LINK REL conventions for finding RSS feeds too...

cheers,

Dan

> Any comments as to what you think is an appropriate interval for (re)visiting
> pages which are likely to change would be appreciated.
> 
> TIA.
> Charles
> 
> PS.  The robot/engine are written in Java and they are Open Source.

cool :)

Received on Friday, 22 August 2003 03:06:51 UTC