- From: Gerald Oskoboiny <gerald@w3.org>
- Date: Wed, 15 Oct 2008 17:42:27 -0700
- To: public-qa-dev@w3.org
Hi, I noticed a lot of bogus requests in validator.w3.org's logs a la: 3.14.159.265 - - [15/Oct/2008:20:19:29 -0400] "GET /feed/check.cgi/docs/ docs/news/archives/2007/10/15/docs/news/archives/2007/10/15/news/archives/ 2007/10/15/news/archives/2007/10/15/docs/news/archives/2007/10/15/docs/ docs/news/archives/2007/10/15/news/archives/2007/10/15/docs/news/archives/ 2007/10/15/docs/docs/news/archives/2007/10/15/news/archives/2007/10/15/ news/archives/2007/10/15/docs/ HTTP/1.1" 200 2020 "-" "somebot/1.0" This happens because web bots commonly add or remove trailing slashes from URIs at will, then when they come along and request something like http://validator.w3.org/feed/check.cgi/ they end up in infinite URI spaces like the above. Could you please update the feed validator and any other scripts you can think of that have this same problem to return something besides HTTP 200 in this case? (either a redirect or an error) 90% of the requests for /feed/check.cgi today have this type of bogus path info. (almost entirely from one well-known bot) Also, please add this to validator.w3.org's robots.txt: Disallow: /feed/check.cgi (that addition on its own would be enough to make the path info problem go away for well-behaved bots, but I think scripts should do the right thing with bogus path info in any case because not all bots are well-behaved.) This basic problem (scripts that return 200 OK when given extra path info, leading to infinite URI spaces) exists for dozens of other apps we have at W3C, and we fix them as we notice them but I wonder how to monitor for this in general? What do other sites do? thanks! -- Gerald Oskoboiny http://www.w3.org/People/Gerald/ World Wide Web Consortium (W3C) http://www.w3.org/ tel:+1-604-906-1232 mailto:gerald@w3.org
Received on Thursday, 16 October 2008 00:42:39 UTC