Re: New response code

"MT" == Mirsad Todorovac spake thusly:
MT> 
MT> There are lots of documents, generated by cgi-scripts or of other
MT> origin, not appropriate for indexing by WWW robots.  Or we may not want them
MT> to be indexed (like, you make an CGI to list files in a large database [my
MT> experience RFC repository] and then robot picks avery documents and indexes
MT> it from the result of that CGI script, which was not what was ment to happen).

Nor do the authors of robots want them accessing such information, so there is
a standard robot exclusion protocol which works above the http layer.  See
below.

MT> This leads to following:
MT> 
MT> Shouldn't there be a way to specify which URL's we want to be indexed,
MT> and which we do not want to be indexed?
MT> 
MT> 
MT> Question(s):
MT> 
MT> how can the server now whether it is accessed by a browser or
MT> by a robot?  He could analyze <bold>User-agent:</bold> field in header, but
MT> won't there be new robots which weren't existing while the server was
MT> configured?

There are dozens of publicly announced robots, and new ones come out far too
frequently.  It's tough to distinguish between a robot and a fast browser.  To
avoid this sort of problem, there is a robot exclusion protocol, which lets
you say things like "all robots should avoid this document tree."
Unfortuntely, it relies on voluntary compliance by robot writers.  Still, it's
better than nothing.

For more info, see
http://info.webcrawler.com/mak/projects/robots/norobots.html and/ or subscribe
to the robots mailing list (majordomo@webcrawler.com for more info)

-- 
			  Mordechai T. Abzug
http://umbc.edu/~mabzug1   mabzug1@umbc.edu     finger -l mabzug1@gl.umbc.edu
If you believe in telekinesis, raise my hand. 

Received on Monday, 19 February 1996 06:06:10 UTC