RE: ACTION- 541: Jeni to help Dan pull together terminology on Deep Linking

well, sure: "server with copies of web content".  


> My feeling is that first, having several dimensions on which we might classify a server as cache or archive

Classification is hard when there are hybrid services that operate as both, as you show. Why do we need to 'categorize' them vs 'describe' them?

> .... means makes the distinction fuzzy and therefore hard to use (what if it falls into one category on one dimension and into another on another dimension?); and second, that it is easier for people to classify a particular server if we focus on behaviour rather than concepts such as "purpose" and "value" which can be harder to pin down.

I thought "purpose" and "value" were interesting but I agree not evaluable.

 > For example, take a search engine that stores copies of the pages that its crawlers retrieve and then provides users with access to these stored copies at a separate URI. Is this a cache or an archive?

Interesting hybrid case. Clearly not a "cache" in the classical sense of caching. Not clearly an "archive", though.
 
> You could view the purpose of the copy as being a means of improving network performance (in its general sense)

Oh no, I didn't think a "cache" improved performance in its general sense. I meant specifically about latency of retrieval,  lessening the time it takes to do something without interfering significantly with the operation.

> ... because it enables users to continue to access information even when the origin server is inaccessible. 

That's an archival function, not a cache function.

> Certainly, a search engine will only keep one copy of the page and will endeavour to keep that copy up to date, and the more out of date the page, the less valuable it is for the search engine's primary task of providing people with up-to-date relevant information.

good point, demonstrating that the "value" dimension isn't criterial.

> On the other hand, the search engine will probably keep its copy around for a little while even if the original page disappears (and provides benefit to its users by doing so). It won't obey any of the Cache-control headers that were present when the page was retrieved when someone accesses the search engine's copy (at its separate URI): it is explicitly providing access to a copy just as an archive does.

I'm not sure of the relationship of cache control headers to robots.txt. I would think that a page that is marked "do not cache ever" should also not get retained by a search engine.

> Another example is the UK Government Web Continuity project, which you would generally classify as an archive (it provides copies at separate URIs, keeps them indefinitely) but has an additional purpose of providing continuing access to material which is removed from department websites -- this being particularly important after a change of administration. When used for *that* purpose, the value of the pages in the archive diminishes over time, as the pages become older and out of date.


Hmmm, this sounds purely like an archive, and I just disagree with your "value ... diminishes over time ..."
 
> So let's see if we can find a different term than 'cache' for the general concept, explore these different features, and try to pin down the definitions of 'cache' and 'archive' more tightly. Perhaps there are existing definitions we can reuse?
> 

"server with copy of stuff"

covers both "cache" and "archive", do we need anything fancier?

Received on Friday, 8 April 2011 19:10:02 UTC