Re: ACTION- 541: Jeni to help Dan pull together terminology on Deep Linking

Larry,

I'm happy to use a term other than 'cache' (anyone have suggestions?), but I think we need to have an overarching term for 'serving a copy of a representation originally hosted on another server'.

Your descriptions of caches and archives identify several properties that can be used to differentiate between them:

  * purpose: improving network performance vs providing long-term accessibility
  * value: cached material is less valuable over time vs archived material more so
  * access control: follows original access controls vs imposes new access controls

to which we can add, as Chris pointed out:

  * uri: uses original URI vs uses new URI

My feeling is that first, having several dimensions on which we might classify a server as cache or archive means makes the distinction fuzzy and therefore hard to use (what if it falls into one category on one dimension and into another on another dimension?); and second, that it is easier for people to classify a particular server if we focus on behaviour rather than concepts such as "purpose" and "value" which can be harder to pin down.

For example, take a search engine that stores copies of the pages that its crawlers retrieve and then provides users with access to these stored copies at a separate URI. Is this a cache or an archive?

You could view the purpose of the copy as being a means of improving network performance (in its general sense) because it enables users to continue to access information even when the origin server is inaccessible. Certainly, a search engine will only keep one copy of the page and will endeavour to keep that copy up to date, and the more out of date the page, the less valuable it is for the search engine's primary task of providing people with up-to-date relevant information.

On the other hand, the search engine will probably keep its copy around for a little while even if the original page disappears (and provides benefit to its users by doing so). It won't obey any of the Cache-control headers that were present when the page was retrieved when someone accesses the search engine's copy (at its separate URI): it is explicitly providing access to a copy just as an archive does.

Another example is the UK Government Web Continuity project, which you would generally classify as an archive (it provides copies at separate URIs, keeps them indefinitely) but has an additional purpose of providing continuing access to material which is removed from department websites -- this being particularly important after a change of administration. When used for *that* purpose, the value of the pages in the archive diminishes over time, as the pages become older and out of date.

So let's see if we can find a different term than 'cache' for the general concept, explore these different features, and try to pin down the definitions of 'cache' and 'archive' more tightly. Perhaps there are existing definitions we can reuse?

Cheers,

Jeni
-- 
Jeni Tennison
http://www.jenitennison.com

Received on Friday, 8 April 2011 07:58:21 UTC