- From: Tim Bray <tbray@textuality.com>
- Date: Tue, 23 Apr 2002 23:57:16 -0700
- To: Joshua Allen <joshuaa@microsoft.com>
- Cc: Mark Nottingham <mnot@mnot.net>, www-tag@w3.org
Joshua Allen wrote: > I wasn't claiming that crawlers *can't* crawl querystrings, but any > crawlers I have used require you to deliberately turn this on or specify > in a filter which querystrings are "safe". I run a crawler internally > at Microsoft which crawls pages with querystrings, in fact. But I > deliberately configured it to do so, and only with pages that I know to > be "safe". I could show you search results that index URLs with > querystrings, but that certainly doesn't mean that I consider *all* URLs > with querystrings to be "safe" to GET. I have written two very large-scale high-performance web crawlers that were deployed in production, processing hundreds of millions of web pages. Yes, any such beast has a bunch of heuristics for staying away from dangerous pages. But the existence of a '?' just isn't good enough. When you run a large public robot you get 2 classes of complaint: 1. "you moron, your robot went in my off-limits area and now I'm going to get fired and they'll turn off my child's iron lung" 2. "you moron, why aren't you indexing my pages, because if I don't get more traffic to my website I'll go bankrupt and they'll turn off my child's iron lung." The Robot Exclusion Protocol helps. Intelligent self-defense helps. But robots really do live & die on the assumption that if it's a URI and there's no keep-off sign, you can do a GET on it. > There is no way to guarantee that all URLs will be free of GET > side-effects, and it would be misleading to tell people that such a > guarantee exists. No, but if someone posts a URL for which doing a GET produces a side-effect you can legitimately (and I believe in a court of law) tell 'em to take a flying leap if they come after you for the consequences of doing a GET. -Tim
Received on Wednesday, 24 April 2002 02:57:31 UTC