- From: Jeffrey Mogul <mogul@pa.dec.com>
- Date: Fri, 29 Dec 95 10:52:44 PST
- To: http-caching@pa.dec.com
Koen writes: On a related note, I recently discovered that the Netscape client cache, if configured to `verify document: every time', will indeed do a conditional GET for every new request on a resource that lacks an Expires header. Eek. I thought that `verify document' applied to conditional GETs on expired documents only, so I had enabled this option on my Netscape copy. I am a bit disturbed by Netscape having this cache configuration option at all. If only 10% of Netscape users enable it, this will they will cause an enormous increase in the number of conditional GETs going over the net. I think this (Netscape's "verify document always") feature may be a symptom of the relatively poor coverage of "Expires:" in the current Web. Most of you known about http://altavista.digital.com, the Web crawler and search engine developed by several of my colleagues in Digital's research labs. It turns out that the crawler logs all of the response headers for all of the pages it has retrieved. So I decided to survey those headers to see how Expires: is currently used. Actually, I looked at the results of a test crawl that was done several months ago, not the one that was used to populate the existing database. For various reasons (such as the enormous amount of data involved, other loads on the machine in question, and a power failure during my log analysis), I was only able to analyze about 3 million headers. And it's possible that these are not an accurate sample of the entire crawlable Web, but I have no a priori reason to believe otherwise. However, I suspect that some parts of the Web not accessible to crawlers (for example, stock quote services) are more dynamic and may make more use of Expires: headers. Anyway, the results: of 3094665 responses that I analyzes, 7031 had Expires: headers. That's about 0.23%. Since the logs are broken down into chunks of about 90K responses, I was able to determine that in no group of 90K responses were Expires: headers used more than about 0.35% of the time, or less than about 0.13% of the time. In other words, the fraction seems relatively stable across large numbers of URLs. I also looked at the individual Expires: values, and found some interesting things. First of all, servers are not consistent about the date format they use. I found: Mon Sep 18 19:11:16 1995 Mon, 18 Sep 1995 00:31:15 GMT Mon, 18-Sep-95 04:22:18 GMT I also found these values: 0 1 Jan 1970 00:00:00 UT now Mon, 01 Jan 1900 00:00:00 GMT Mon, 01-Jan-1990 00:00:00 GMT which are different ways of encoding "already expired". I found a few values far in the future: Fri, 31 Dec 1999 23:59:59 GMT (someone still thinks the world will end before 1/1/2000). This value looks a little dubious, both because the 1.1 draft is quite specific about using GMT only, and because the asctime date is not supposed to include a timezone anyway. Mon Sep 18 00:30:00 EDT 1995 Finally, I found these definitely bogus values: , GMT , 16--95 16:13:58 GMT 16:08:57 GMT/3.0 , 16--95 16:58:14 GMT 16:53:14 GMT 180unday, 17-Sep-95 17:38:22 GMT Mon, 18 Sep 1995 0-18:08:00 GMT So in summary I would say that it might well be sadly reasonable to ignore "Expires:" today, since it's almost never used, and when it is used, it is often clearly bogus. -Jeff
Received on Friday, 29 December 1995 18:57:32 UTC