- From: Mordechai T. Abzug <mabzug1@gl.umbc.edu>
- Date: Mon, 19 Feb 1996 09:02:38 -0500 (EST)
- To: mirsad.todorovac@fer.hr
- Cc: mabzug1@gl.umbc.edu, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
"MT" == Mirsad Todorovac spake thusly: MT> MT> There are lots of documents, generated by cgi-scripts or of other MT> origin, not appropriate for indexing by WWW robots. Or we may not want them MT> to be indexed (like, you make an CGI to list files in a large database [my MT> experience RFC repository] and then robot picks avery documents and indexes MT> it from the result of that CGI script, which was not what was ment to happen). Nor do the authors of robots want them accessing such information, so there is a standard robot exclusion protocol which works above the http layer. See below. MT> This leads to following: MT> MT> Shouldn't there be a way to specify which URL's we want to be indexed, MT> and which we do not want to be indexed? MT> MT> MT> Question(s): MT> MT> how can the server now whether it is accessed by a browser or MT> by a robot? He could analyze <bold>User-agent:</bold> field in header, but MT> won't there be new robots which weren't existing while the server was MT> configured? There are dozens of publicly announced robots, and new ones come out far too frequently. It's tough to distinguish between a robot and a fast browser. To avoid this sort of problem, there is a robot exclusion protocol, which lets you say things like "all robots should avoid this document tree." Unfortuntely, it relies on voluntary compliance by robot writers. Still, it's better than nothing. For more info, see http://info.webcrawler.com/mak/projects/robots/norobots.html and/ or subscribe to the robots mailing list (majordomo@webcrawler.com for more info) -- Mordechai T. Abzug http://umbc.edu/~mabzug1 mabzug1@umbc.edu finger -l mabzug1@gl.umbc.edu If you believe in telekinesis, raise my hand.
Received on Monday, 19 February 1996 06:06:10 UTC