bogus path info on feed validator reqs; robots.txt

Hi,

I noticed a lot of bogus requests in validator.w3.org's logs a la:

   3.14.159.265 - - [15/Oct/2008:20:19:29 -0400] "GET /feed/check.cgi/docs/
    docs/news/archives/2007/10/15/docs/news/archives/2007/10/15/news/archives/
    2007/10/15/news/archives/2007/10/15/docs/news/archives/2007/10/15/docs/
    docs/news/archives/2007/10/15/news/archives/2007/10/15/docs/news/archives/
    2007/10/15/docs/docs/news/archives/2007/10/15/news/archives/2007/10/15/
    news/archives/2007/10/15/docs/ HTTP/1.1" 200 2020 "-" "somebot/1.0"

This happens because web bots commonly add or remove trailing
slashes from URIs at will, then when they come along and request
something like http://validator.w3.org/feed/check.cgi/ they end
up in infinite URI spaces like the above.

Could you please update the feed validator and any other scripts
you can think of that have this same problem to return something
besides HTTP 200 in this case? (either a redirect or an error)

90% of the requests for /feed/check.cgi today have this type of
bogus path info. (almost entirely from one well-known bot)

Also, please add this to validator.w3.org's robots.txt:

    Disallow: /feed/check.cgi

(that addition on its own would be enough to make the path info
problem go away for well-behaved bots, but I think scripts should
do the right thing with bogus path info in any case because not
all bots are well-behaved.)

This basic problem (scripts that return 200 OK when given extra
path info, leading to infinite URI spaces) exists for dozens of
other apps we have at W3C, and we fix them as we notice them but I
wonder how to monitor for this in general? What do other sites do?

thanks!

-- 
Gerald Oskoboiny     http://www.w3.org/People/Gerald/
World Wide Web Consortium (W3C)    http://www.w3.org/
tel:+1-604-906-1232             mailto:gerald@w3.org

Received on Thursday, 16 October 2008 00:42:39 UTC