RE: whenToUseGet-7 counter-proposal from Joshua Allen on 2002-04-24 (www-tag@w3.org from April 2002)

From: Joshua Allen <joshuaa@microsoft.com>
Date: Tue, 23 Apr 2002 23:20:55 -0700
To: "Mark Nottingham" <mnot@mnot.net>
Cc: <www-tag@w3.org>
Message-ID: <4F4182C71C1FDD4BA0937A7EB7B8B4C104F058C9@red-msg-08.redmond.corp.microsoft.com>

> Interesting. My experience is completely different, and I wouldn't
refer
> to that as an arcane bug at all.

Oh, it was arcane allright.  It involved a combination of pages with
cache-control-private set and JPEGs in those pages having no cache
hints, and the images' (lack of) cache hints being erroneously applied
to the pages.

> > And any crawlers I have used are deliberately designed to ignore
URIs
> > with querystrings.
> 
> See Paul's reference re: Google. I'd seen the same behaviour, but
didn't
> have an example so handy. (Thanks, Paul!)

I wasn't claiming that crawlers *can't* crawl querystrings, but any
crawlers I have used require you to deliberately turn this on or specify
in a filter which querystrings are "safe".  I run a crawler internally
at Microsoft which crawls pages with querystrings, in fact.  But I
deliberately configured it to do so, and only with pages that I know to
be "safe".  I could show you search results that index URLs with
querystrings, but that certainly doesn't mean that I consider *all* URLs
with querystrings to be "safe" to GET.  

There is no way to guarantee that all URLs will be free of GET
side-effects, and it would be misleading to tell people that such a
guarantee exists.

(I also would be shocked to hear that Google's implementation is blindly
crawling querystrings with no heuristics to determine safeness.  Without
input from the Google guys, I guess we just have to speculate.)

Received on Wednesday, 24 April 2002 02:22:02 UTC