Re: Some data related to the frequency of cache-busting

Jeffrey Mogul wrote:
> 
> If someone would like to propose a *feasible* filter on URLs
> and/or response headers (i.e., something that I could implement
> in a few dozen lines of C) that would exclude other CGI
> output (i.e., besides URLs containing "?" or "cgi-bin", which
> I already exclude), then I am happy to re-run my analysis.

You can check for everything that ends with ".cgi" and ".nph" as well
as everything that starts with "nph-". Don't forget that CGIs can
have trailing path info.

> Drazen Kacar pointed out that I should probably have
> excluded .shtml URLs from this category, as well, because
> they are essentially the same thing as CGI output.  I checked
> and found that 354 of the references in the trace were to .shtml
> URLs, and hence 10075, instead of 10429, of the references
> should have been categorized as possibly cache-busted.  (This
> is a net change of less than 4%.)

There is a short (3 char) extension as well. I don't know which one.
I think it's ".shm", but I'm not sure. You'll get additional percent
or two if you inlude all of these.

>     I would say the only *confirmable* deliberate cache busting done
>     are the 28 pre-expired responses. And they are an insignificant
>     (almost unmeasurable) percentage of the responses.

Some of them are probably due to HTTP 1.0 protocol and could have been
cacheable if the server could count on vary header being recognized by
the client.

> In short, if we are looking for a rigorous, scientific *proof*
> that cache-busting is either prevalent or negligible, I don't
> think we are going to find it in traces, and I can't think of
> where else one might look.

I can. On-line advertising mailing lists. I'm subscribed to one of those
not because it's my job, but to stay in touch with the web things. I'm
just a lurker there (OK, I'm a lurker here as well, but not because I
want to. I can't find time to read the drafts and I'm at least two
versions behind with those I did read.)

People on the list are professionals and experts in their field, but
not in HTML or HTTP. A month ago somebody posted "a neat trick" which
had these constructs in HTML source:

<FONT FACE="New Times Roman" "Times Roman" "Times" SIZE=-1>...</FONT>
<A HREF=...><TABLE>...</TABLE></A>

Than somebody else pointed out that Netscape won't make the whole table
clickable if it's contained in anchor. The answer from the original author
started with "For some reason (and I don't know why) it seems that
Netscape can't...". I let that one pass to see if anyone would mention
DTDs, syntax, validators or anything at all. No one did. This is viewed
as lack of functionality in NSN, and not as trully horrible HTML.
To be fair, I must mention that most of them know a thing or two about
ALT attributes and are actively fighting for its usage. They probably
don't know it's required in AREA, but IMG is a start. My ethernal
gratitude to people who are fighting on comp.infosystems.www.html. I stopped
years ago.

Another example is HTTP related. There was talk about search engines and
one person posted that cheating them is called "hard working". Than there
was a rush of posts saying that is not ethical and that pages text that
contains repeating of key words could come up on the top of the list, but
it would look horrible when the customer really requests the page. No one
mentioned that you can deliver one thing to the search engine and another
to the browser.

To conclude, marketing people are clueless about HTML and (even more) HTTP
and they can't participate on this list. It's not that they would not
want to. They have some needs and if those are not met with HTTP, responses
will be made uncacheable as soon as they find out how to do it.
I'm doing the same thing because of charset problems. It's much more
important for the information provider that users get the right code page
than to let proxy cache the wrong one. OK, I'm checking for HTTP 1.1 things
which indicate that I can let the entity body be cacheable, but those are
not coming right now and (reading the wording in HTTP 1.1 spec) I doubt
they will.

A few examples of what's needed...

Suppose I need high quality graphics for the page, but it's not mandatory.
I'll make two versions of pictures, one will have small files and the
other will (can't do anything about it) have big files. I can conclude
vie feature negotiation if the user's hardware and software can display
high quality pictures, but not if the user wants it, ie. if the bandwidth
is big enough or if the user is prepared to wait.
So, I'll display low res pictures by default and put a link to the same
page with high res graphics. User's preference will be sent back to him in
the cookie. It's really, really hard and painfull to maintain two versions
of pages just for this and I'd want my server to select appropriate picture
based on URL and the particular cookie. What happens with the proxy?
I can send "Vary: set-cookie", but this is not enough. There'll be other
cookies. On a really comercial site there'll be one cookie for each user.
People are trying to gather information about their visitors. I can't
blame them, although I have some ideas about preventing this. (Will have
to read state management draft, it seems). Anyway, this must be made
non cacheable. Counting on LOWSRC is not good enough. 

Another thing are ad banners. Some people are trying not to display the
same banner more than 5 or 6 time to a particular user. The information
about visits is stored in (surprise, surprise) cookie. The same thing, again.

I think that technical experts should ask the masses what's needed. Don't
expect the response in the form of Internet draft, though.

-- 
Life is a sexually transmitted disease.

dave@fly.cc.fer.hr
dave@zemris.fer.hr

Received on Tuesday, 3 December 1996 00:55:20 UTC