- From: Alexander Melkov <melkov@comptek.ru>
- Date: Sat, 18 Jan 2003 06:42:47 +0300
- To: <www-talk@w3.org>
Recently Martijn Koster (www.robotstxt.org) wrote me that he is "no longer involved with robots", but W3C is. I've searched w3.org through and it seems like that no one here is involved, too :) Probably this mailing list is the best place to start. Any comments, please. Our company is running a search engine, Yandex.ru (it's scope of interest is .ru domain and some other national/russian language resources). Besides per-document duplicate accounting we maintain a database of full site mirrors (like many SE's do, I suppose). In this point of view, every two web sites with different host names/ports but the same content are mirrors (since HTTP 1.1 allows these addresses to have different content). There are two primary uses of mirror database: - indexing robot only visits the main mirrors - full mirrors are glued up without problems during link popularity calculations When you have a number of mirrors, you'll have to choose one of them as a main. The fact is that no algorithm of automatic choice can guess what any particular webmaster really thinks to be the main mirror of his site. How can a webmaster point out the main mirror to the robots? One of the intuitive answers is to tell the robots what mirrors they should NOT INDEX. Trivially, with robots.txt: User-agent: * Disallow: / But new problems arise in this case: a) A robot might never know about any particular mirrors because is never visits it due to robots.txt mentioned above. b) A webmaster may have technical difficulties with generating different robots.txt files in case of name aliasing on a single IP. Yes, only a few people in .ru-net managed to implement this simple example for Apache web-server: <!--#if expr=" \"${HTTP_HOST}\" != \"www.main_name.ru\" " --> User-Agent: * Disallow: / <!--#endif --> ===================================== Our proposal: Support for Host directive in robots.txt. Example: User-Agent: * Disallow: /forum Disallow: /cgi-bin Host: www.myhost.ru Placement: in record, after User-agent(s), i.e. just where disallow lines are placed. Formal action: - Host line with incorrect argument is ignored, just like it doesn't exist - no (correct) Host lines - no action - any Host line that has it's argument matching current host/port - no action - has correct Host lines, no one of their arguments matched - imply "Disallow: /" in the end of the group. That is, for host www.host1.com and www.myhost.ru:8081 the above example ends up in User-Agent: * Disallow: /forum Disallow: /cgi-bin Disallow: / While for www.myhost.ru:80 we have User-Agent: * Disallow: /forum Disallow: /cgi-bin Expecting that some robots would not ignore 'Host' inside the record, we had two choises: to place it before the first 'User-agent' or after the last 'disallow'. The first choice makes robots.txt two-dimensional (User-agent * Host), so we've implemented the latter. Enough for today, Best regards, Alexander Melkov
Received on Friday, 17 January 2003 22:45:20 UTC