Re: Redirection from jon@hackcraft.net on 2003-11-12 (w3c-wai-ig@w3.org from October to December 2003)

From: <jon@hackcraft.net>
Date: Wed, 12 Nov 2003 10:45:37 +0000
To: "w3c-wai-ig@w3.org" <w3c-wai-ig@w3.org>
Message-ID: <1068633937.3fb20f511f94f@217.114.163.71>
> > I suspect it doesn't show up that often because not that many people
> 
> I suspect the reasons include:
> 
> - most ? references are still implied through forms;
> - many sites have their CGI areas blocked in robots.txt;
> - most sites using non-trivial ? references in href attributes have broken
>   URLs because they don't escape & properly or use the alternative of ; 
>   suggested in an HTML specification appendix;
> - there may be a tendency to use meta to inhibit caching on such pages.
> 
> I consider this particular, Cold Fusion, technique as an abuse of URLs
> which, by confusing the mechanics of creating the HTML with the naming
> of the resource, causes misoperation of things like proxies (Squid's
> default rules are to send ? URLs direct to the origin, rather than to
> an upstream cache, as it expects not to get a cachable resource back).
> 
I consider that proxy behaviour, if not an abuse of HTTP then a failure to 
implement it as well as is possible. Admittedly it's probably a failure based 
on adapting to practical experience with sites which in turn fail to implement 
HTTP as well as is possible.

A URI containing a query string is of equal status to any other URI, though it 
may be weaker in terms of human-readable qualities. A GET to such a URI is just 
as capable of returning a cachable resource and developers should strive to 
assist this caching (setting Last-Modified, reacting appropriately to If-
Modified-Since). I've used query-strings on numerous occasions (sometimes, in 
the case of searches, this was the most sensible way to go; sometimes it was 
due to the relative difficulty in generating data-driven sites any other way 
with certain tools). While I do avoid the use of query-strings I have not found 
their use to get in the way of caching. In particular there are some cases 
where the data-driven nature, combined with a knowledge of the mechanics 
producing that data, offers a reliable way of determining expiry dates, with a 
tremendous gain to caching efficiency.

As for search engines, the reason that URIs with query strings are less likely 
to get indexed is that search engine people don't want their spiders to spend 
eternity indexing a site that is produced on the fly for which there may be an 
infinite number of URIs to be found in the generated pages.
As an example of such a page I once wrote a joke version of RSS as a satire of 
the version numbers used by rival versions, each page would claim to have been 
obsoleted by another version which it linked to, with version 10234.0 pointing 
to version 10235.0 and so on. This would go on until version 2147483647.0 after 
which it would trigger an overflow error I couldn't be bothered checking for on 
what was after all a joke. If google had started indexing that page it would 
still be there :)

Of course this can also happen with URIs that don't contain query strings, but 
such cases are rarer.

Google will generally list a page if it is linked to from a page which doesn't 
contain a query string in the URI (I'd guess that includes HTTP redirects from 
such a page, but I'm not sure), having few parameters helps as well.

--
Jon Hanna
<http://www.hackcraft.net/>
*Thought provoking quote goes here*
Received on Wednesday, 12 November 2003 05:45:38 UTC