Site mirrors and aliases: 'Host' field proposed for robots.txt

Recently Martijn Koster (www.robotstxt.org) wrote me that he is "no longer involved with robots", but
W3C is. I've searched w3.org through and it seems like that no one here is involved, too :)
Probably this mailing list is the best place to start. Any comments, please.

Our company is running a search engine, Yandex.ru (it's scope of interest is .ru domain and some other
national/russian language resources).

Besides per-document duplicate accounting we maintain a database of full site mirrors (like many SE's
do, I suppose). In this point of view, every two web sites with different host names/ports but the
same content are mirrors (since HTTP 1.1 allows these addresses to have different content).

There are two primary uses of mirror database:
- indexing robot only visits the main mirrors
- full mirrors are glued up without problems during link popularity calculations

When you have a number of mirrors, you'll have to choose one of them as a main. The fact is that no
algorithm of automatic choice can guess what any particular webmaster really thinks to be the main
mirror of his site.

How can a webmaster point out the main mirror to the robots? One of the intuitive answers is to tell
the robots what mirrors they should NOT INDEX. Trivially, with robots.txt:
User-agent: *
Disallow: /

But new problems arise in this case:
a) A robot might never know about any particular mirrors because is never visits it due to robots.txt
mentioned above.
b) A webmaster may have technical difficulties with generating different robots.txt files in case of
name aliasing on a single IP.
Yes, only a few people in .ru-net managed to implement this simple example for Apache web-server:
 <!--#if expr=" \"${HTTP_HOST}\" != \"www.main_name.ru\" " -->
 User-Agent: *
 Disallow: /
 <!--#endif -->

=====================================
Our proposal:
Support for Host directive in robots.txt.

Example:
User-Agent: *
Disallow: /forum
Disallow: /cgi-bin
Host: www.myhost.ru

Placement: in record, after User-agent(s), i.e. just where disallow lines are placed.
Formal action:
- Host line with incorrect argument is ignored, just like it doesn't exist
- no (correct) Host lines - no action
- any Host line that has it's argument matching current host/port - no action
- has correct Host lines, no one of their arguments matched - imply "Disallow: /" in the end of the
group.

That is, for host www.host1.com and www.myhost.ru:8081 the above example ends up in
User-Agent: *
Disallow: /forum
Disallow: /cgi-bin
Disallow: /

While for www.myhost.ru:80 we have
User-Agent: *
Disallow: /forum
Disallow: /cgi-bin

Expecting that some robots would not ignore 'Host' inside the record, we had two choises: to place it
before the first 'User-agent' or after the last 'disallow'. The first choice makes robots.txt
two-dimensional (User-agent * Host), so we've implemented the latter.

Enough for today,
Best regards, Alexander Melkov

Received on Friday, 17 January 2003 22:45:20 UTC