Re: whenToUseGet-7 counter-proposal

Joshua Allen wrote:

> I wasn't claiming that crawlers *can't* crawl querystrings, but any
> crawlers I have used require you to deliberately turn this on or specify
> in a filter which querystrings are "safe".  I run a crawler internally
> at Microsoft which crawls pages with querystrings, in fact.  But I
> deliberately configured it to do so, and only with pages that I know to
> be "safe".  I could show you search results that index URLs with
> querystrings, but that certainly doesn't mean that I consider *all* URLs
> with querystrings to be "safe" to GET.  

I have written two very large-scale high-performance web crawlers that 
were deployed in production, processing hundreds of millions of web 
pages.  Yes, any such beast has a bunch of heuristics for staying away 
from dangerous pages.  But the existence of a '?' just isn't good 
enough.  When you run a large public robot you get 2 classes of 
complaint: 1. "you moron, your robot went in my off-limits area and now 
I'm going to get fired and they'll turn off my child's iron lung" 2. 
"you moron, why aren't you indexing my pages, because if I don't get 
more traffic to my website I'll go bankrupt and they'll turn off my 
child's iron lung." The Robot Exclusion Protocol helps.  Intelligent 
self-defense helps.  But robots really do live & die on the assumption 
that if it's a URI and there's no keep-off sign, you can do a GET on it.

> There is no way to guarantee that all URLs will be free of GET
> side-effects, and it would be misleading to tell people that such a
> guarantee exists.

No, but if someone posts a URL for which doing a GET produces a 
side-effect you can legitimately (and I believe in a court of law) tell 
'em to take a flying leap if they come after you for the consequences of 
doing a GET.  -Tim

Received on Wednesday, 24 April 2002 02:57:31 UTC