Interesting stats on the prevalence of bogus Last-Modified times

For a paper that I'm working on (with several other people), I needed
to extract the "Last-Modified" values from a proxy trace that I made
last December.  The trace contains 504736 records, representing the
activity of 7411 distinct client hosts, accessing 22034 distinct
servers, referencing 238663 distinct resources (URLs).  I.e., it's a
significant slice of the Web.

Anyway, we were surprised to find that a significant fraction of
the Last-Modified values appeared to be in the future; i.e., the
Last-Modified time was actually newer than the time at which
the request was completed.  (We timestamped our log entries
on a system synchronized with NTP to a nearby GPS clock.)

One part of the problem turned out to be a bug in the date-parsing
code that we borrowed from the CERN httpd program.  If you are
using a routine called parse_http_time() from this code, you might
want to check that it gives the right values in all cases.  In
particular, if daylight savings time is in effect when you parse
the date, but not at the time specified by the date (or vice versa),
the result may be wrong by an hour.

Anyway, after fixing that bug, we still found that somewhat over
1% of the traced responses had "future" Last-Modified dates,
or future "Date" dates.

These tended to fall into two apparent categories:
	(1) servers that probably had their clocks set wrong
	(2) servers that sent non-GMT Last-Modified values.

For example, a large fraction of the "future" values were just
a little bit in the future.  A suspiciously large spike in
the distribution of errors appears at around 60 seconds;
it looks like some people set their clocks to the right second,
but the wrong minute.

Other, smaller spikes appear near multiples of 3600 seconds
(one hour); these may be from people sending time in non-GMT
timezones, or it may be people who have set their clocks
to the right minute, but the wrong hour.  For some reason,
there is a spike near 3.5 hours; maybe this is from sites
in one of the places where the timezone offset is not an
integral number of hours.  Finally, there are a few sites
who seem to be off by exactly one day.

Of course, some of the Last-Modified dates might be set into
the future for some bizarre caching-related reason, but this
seems rather unlikely to be of actual benefit.

-Jeff

Received on Thursday, 3 July 1997 18:08:14 UTC