Re: Suggestion/Enhancement request: Please add to FAQ: list of User-agents for robots.txt files from Michael[tm] Smith on 2015-08-10 (www-validator@w3.org from August 2015)

From: Michael[tm] Smith <mike@w3.org>
Date: Mon, 10 Aug 2015 12:56:31 +0900
To: Andrew Avdeenko <rasprod@tyt.by>
Cc: www-validator@w3.org
Message-ID: <20150810035631.GC963@sideshowbarker.net>

Andrew Avdeenko <rasprod@tyt.by>, 2015-08-02 16:00 +0300:
> Archived-At: <http://www.w3.org/mid/op.x2qrq8zm278snb@microsof-c0ae01>
> 
> Is it possible to deny access to my website for W3C validators using
> robots.txt? If "yes", what user-agent(s) must be specified?

The W3C Link Checker https://validator.w3.org/checklink is the only one
that’s actually a crawler/robot, and so the only one that pays attention to
robots.txt files. You can block it by specifying “User-Agent: W3C-checklink”.

All of the services also have “http://validator.w3.org/services" in their
user-agent strings, and run on hosts with IP addresses in the 128.30.52.0/24
subnet. So you can block them based on that user-agent substring, or by IP
address—but because none of the services other than the link checker are
crawlers/robots in normal usage, they’re not among the types of tools that
robots.txt is intended for, so you’ll need to use some other means to block
them (e.g., some specific configuration to your firewall or Web server).

There used to be a document at http://validator.w3.org/services which
explained all this but it seems to have disappeared, so I’ll get it
restored as soon as possible.

  —Mike

-- 
Michael[tm] Smith https://people.w3.org/mike

Received on Monday, 10 August 2015 03:56:56 UTC