Clock skew and the importance of clock synchronization in HTTP servers.

The attached is exerpted from an Internet draft Jeff Mogul is submitting
to the ID editor today, with some very interesting trace data. 

As you know, bad dates in documents will badly affect caching behavior in 
the Web, up to and including serving documents long after they should have 
expired to unsuspecting users (with no way in HTTP/1.0 to ever force a reload 
on the cache, this problem will be with us for a long time until most 1.0 
proxies are gone)...

The situation is much worse than I believe most of us or all of us have 
realized. More than 1/5 of the servers are wrong by more than a minute.  
Ugh...  Shudder... Median errors are in the two minute range.

While I will be adding some text to the 1.1 spec encouraging clock 
synchronization for reliable caching operation, there are some concrete 
things that can/should be done by those who have influence over HTTP 
implementations and documentation.

1) installation directions and scripts for Web servers/prxies should strongly 
encourage the use of clock synchronization (e.g. use of NTP or equivalent).  
In server installation directions I've seen, there has never been any mention 
of this topic (not that I've installed a server recently).

2) server implementors might consider some "sanity checks" in their code 
to warn operators that their systems are likely running badly synchronized. 
I can think of some heuristics that might work. I can think of ugly hacks 
like looking for the existance of an NNTP server running.  It may be 
that the system call interfaces to adjusting clocks might or might not be 
useful to warn operators (it's been too long since I looked at how NTP is 
commonly implemented, and whether those system call interfaces provide 
applications useful information on whether the clock is running within the 
phase lock capture range)....  Exactly what might/should be done
here is not completely clear and maybe worth discussion.

In any case, I think at a minimum installation directions for Web servers 
and proxies should get some work to encourage better practice, even if not 
a line of code changes in the software itself.  This is a call for us to 
go poke our respective documentation folks on this topic... 				- Jim


				- Jim Gettys



>From Jeff Mogul's draft on the Age computation problem, being submitted
later today....

   Is clock skew a real problem?  Unfortunately, I know of no systematic
   study of HTTP client clock skews.  This is difficult, in part,
   because HTTP requests generally do not include a Date header.

   However, since I do have access to a trace of the headers flowing
   through a proxy whose clock, at the time of the trace, was carefully
   synchronized using NTP, I was able to look at the clock-skew
   distribution of a large set of HTTP servers.  (The trace covers 22034
   distinct server IP addresses.)  While this is not the same as a
   population of HTTP clients, one might actually expect a set of HTTP
   servers to have better clock synchronization characteristics than a
   set of HTTP clients.  After all, many HTTP clients run on personal
   computers or workstations, and are managed by non-experts; most Web
   servers on the Internet have at least some semblance of
   administration (e.g., someone at least had to obtain a DNS name).  In
   other words, whatever the situation with Web server clocks, one would
   expect the situation among clients to be worse.

   For each response in the trace, I compared the Date header field
   value (if any) to the proxy's NTP-synchronized timestamps for the
   start of the connection and the end of the connection.  If the
   server's clock is accurate, the Date value ought to be between those
   two timestamps.  If the server's clock is slow, the Date value would
   be lower than the start-timestamp; if the server's clock is fast, the
   Date value would be higher than the end-timestamp.

   Because of the 1-second granularity of Date, I treated as "valid" any
   values less than 1 second in error.  I also treated as "obviously
   bogus" any Date where the server's clock appeared to be more than 1
   day wrong, since one could assume that such a badly skewed server
   clock would be abnormal.

   The trace contained 503969 responses with parsable response headers.
   Of these, only 286779 actually had Date headers (most of the rest
   appear to be PointCast responses).  1087 of these had Date values
   that were clearly bogus (by the "1-day-wrong" test).  Of the others,
   116966 (41%) showed a server with a "slow" clock (by at least one
   second), and 83782 (29%) showed a "fast" clock.  Only 84944 (30%) had
   apparently-synchronized clocks.

   What if we set the threshold for an OK clock at +/- 60 seconds
   (which, by the earlier analysis, is somewhat larger than the
   Error_C_bound for N = 6 and Max_RTT = 2)?  In this case, we still
   find 79443 (27%) responses indicating "slow" clocks, and 56429
   responses (20%) indicating "fast" clocks.  In other words, a lot of
   the clocks are off by a lot of time.

   Using the 1-second threshold, the mean error in the slow clocks is
   1287 seconds, with a median error of 113 seconds.  For the fast
   clocks, the mean error is 1383 seconds, with a median of 97 seconds.

   Using the 60-second threshold, the mean error in the slow clocks is
   1884 seconds, with a median error of 198 seconds.  For the fast
   clocks, the mean error is 2039 seconds, with a median of 152 seconds.
   (We're removing the small-error samples from these sets, so we're
   left with sets biased towards high-error samples.)

   In summary, clock skew seems to be prevalent among HTTP servers, and
   the skews seem to be fairly large.  One might be justified in
   guessing that the situation is worse among HTTP clients.

      NOTE: I should reanalyze this data, breaking it down by server
      address, rather than by response, but that will have to wait
      for another draft of this document.

Received on Friday, 12 September 1997 06:13:17 UTC