Automatic Inpection of Caches from huw@dcs.gla.ac.uk on 1995-09-16 (www-lib@w3.org from July to September 1995)

From: <huw@dcs.gla.ac.uk>
Date: Sat, 16 Sep 95 18:10:01 BST
To: www-proxy@www0.cern.ch
Cc: www-lib@www0.cern.ch, huw@dcs.gla.ac.uk
Message-Id: <9509161710.AA09627@barren.dcs.gla.ac.uk>
At the moment it is not possible for clients to be sure that the
cached page they are looking at has been updated since the copy
was cached.  This e-mail proposes a protocol for cache-coherency
of Web pages.

I have cc'd this to www-lib as I'm not sure if www-proxy is still
a valid list.

Huw Evans
Research Assistant
Department of Computing Science
Glasgow University
Glasgow
Scotland

My proposal is for the client to query the server to see if the page
has been changed since it was last retrieved.  When the page is first
retrieved it is tagged with the time at the originating server (called
St).  This information is stored along with the page at the local
cache server.  When anybody attempts to get the file at a site which
holds a cached copy a query is made to the originating server asking
whether the page has been updated since St.  This has the advantage
that the times that are compared are both local to the originating
server so skew is not an issue (if the system clock has been put back,
that's up to them and we can do nothing about that).  The query and
the reply are extremely small messages and the time comparison is
trivial.

Doing this Across the Internet
------------------------------

If the page has not been updated, an appropriate message is sent back
and the cached copy is used.

If the page has been updated a new copy (with the new St) is sent back
and is cached locally.

I would assume that pages are changed relatively infrequently, thus
the query will, the majority of the time, come back Control-cache:
use-cached-copy.

All of this has to be done in the face of errors.  The query takes
place across the Internet, and a reply may not be possible because,
for example, there is no route to the server or the server is not
there or it may take too long to receive a reply because, for example,
the network and/or server may be really slow.  When no reply is
forthcoming within a certain period of time the server is treated as
unreachable and the local copy must be used.  The user should be
informed in a window however as they need be aware they are using,
potentially, old data.

Another reason for there being no document is that it has moved or
been removed.  If the document has been moved, the new location of the
document has to be contacted to see if the document has been changed
and the above should be executed again, possibly going to another
location.  If the document has been removed the user should be
informed as they are looking at a cached copy of data the author has
removed for some reason and they should be aware of this.

The issue of finding a document that has moved should be treated in a
separate discussion as it is a major piece of work in its own right.

Composite Pages
---------------

As pages are made up of a number of different underlying files
(eg. html, gifs, audio samples) a page is deemed to have changed if
any one of its constituent parts has changed.  It is a challenge to
try to send only a minimal amount of data.  For example, a page may
consist of some html, a gif and an audio sample.  If only the gif has
been changed, ideally, only the gif should be sent.  The minimum may
not be possible for all pages, but it may reduce the amount of data
that has to be served on average.  I would assume that html changes
more frequently than any other constituent part of a page which is
favourable as html is ascii which is fast to transfer.
Received on Saturday, 16 September 1995 13:11:45 UTC