Mapping versions to URLs from Jim Whitehead on 1997-09-23 (www-push@w3.org from July to September 1997)

From: Jim Whitehead <ejw@ics.uci.edu>
Date: Tue, 23 Sep 1997 01:41:15 -0700
To: "'Arthur van Hoff'" <avh@marimba.com>
Cc: Push Workshop <www-push@w3.org>, DRP Mailing List <drp@marimba.com>
Message-ID: <01BCC7C1.C861CF00.ejw@ics.uci.edu>
Hi,

I've been thinking about Content-IDs recently, but in a different sense 
from the recent "what's the best syntax for the URIs" discussion.  I've 
been more concerned about the overall mapping between resource versions and 
the URL namespace.

Two assumptions underlying the use of Content-ID's in the DRP specification 
are:
a) the same URL is used to identify multiple versions of the same resource 
(with each version further identified by a content id)
b) use of the Content-ID and Differential-ID headers with GET are the best 
way to retrieve a difference between two versions of the same resource.

These assumptions seem to be predicated on a third assumption that
c) use of DRP should not modify the URL namespace.

I think the URL+Content-ID method of identifying individual versions of a 
resource is far from the best solution for DRP (or WebDAV).  A significant 
drawback to using the Content-ID header to identify a particular version of 
a resource is that it does not provide support for browsing the old 
versions.  If I wanted to make an HTML page (e.g., using the <A HREF=... 
construct) which lists a version history for a particular resource, there 
is no way I can use HTML to link to the individual versions of the resource 
using the URL+Content-ID header identification scheme.

Alternatives to the URL+Content-ID scheme do exist.  One solution is to 
assign a separate URI to each version of a resource.  Once each version has 
a separate URI, these versions can be linked from an HTML page.  These URIs 
can be placed into an index file even more easily than a URI + Content-ID 
pair.  Plus, existing HTTP/1.0 and HTTP/1.1 caches can cache these URIs 
with no modification to their existing cache.

However, using a separate URI for each version of a resource requires a 
different mechanism than the Differential-ID header for retrieving 
differences, and also affects the namespace.  Both these problems are 
minor, and I address them both.

Diff Mechanism.  Once each version has a separate URI, a DIFF method can be 
used to retrieve a difference between any two resources, especially the 
special case of two resources which are different versions of a resource. 
 DIFF has significant performance advantages over GET.  Since the 
Content-ID (as defined by DRP) and Differential-ID headers are unknown to 
current servers, these servers will need to be modified to understand them. 
 When these headers are part of GET processing, this requires coding 
extensions to GET, the most frequently used method on any Web server.  No 
matter how you implement Content-ID and Differential-ID, it will slow down 
processing of GET.  However, if the DIFF method is used, the slowdown to 
GET processing (if any) is much, much smaller than modifying GET to 
understand new headers, possibly requiring a call to an external library.

I can just hear the rebuttal now, "but Jim, we *really* want to retrieve 
differences with GET, since it works with the existing cache and server 
infrastructure."  Which is true, but in a very limited context, since the 
existing server and proxy infrastructure does not understand the Content-ID 
or Differential-ID headers proposed for DRP.  Given that DRP-capable 
proxies would need to store several versions of a resource (adding new 
columns to its cache lookup table), and would also possibly need to compute 
differences between these versions (going beyond mere Vary header support), 
I wonder whether supporting DRP as proposed is more difficult than 
supporting a new method (DIFF) and using all of the existing cache 
infrastructure as-is (since each version is a separate resource, and is 
cached separately, using its specific entity tag).

Namespace.  Returning to the issue of the effects on the URL namespace of 
having a separate URL for each version of a resource, let me address this 
concern by showing that there are simple solutions to this problem.  One 
possible solution to the namespace issue (there are many possible solutions 
to this problem) can be addressed by making each leaf node of the original 
namespace hierarchy tree into an internal node (i.e., by making it a 
collection).  The versions of each resource can then be located under its 
respective collection.  So, for example, if I had a resource foo.html with 
2 versions, each with its own Content-ID, as follows:

foo.html; Content-ID: xyz1
foo.html; Content-ID: 3456

This resource can be made into a collection, and the versions of foo.html 
can be placed under the collection:

foo.html/server_id_h4z26    (maps to old C-ID: xyz1)
foo.html/server_id_h67z4    (maps to old C-ID: 3456)

A GET on foo.html can be defined to return the latest version, identical to 
the DRP default functionality.  Note that this only one of many possible 
ways a mapping between resource versions and URLs can be accomplished. 
 Note also that the same underlying delta storage technique can be used for 
the original DRP proposal as for this proposal -- this is just suggesting a 
different mapping of the storage to the namespace.  In this namespace 
scheme, the URIs are opaque, and completely under the control of the 
server.

To summarize:

The URL+Content-IDs scheme proposed in DRP is a suboptimal solution to the 
problem of mapping resource versions into the URL space.  This solution 
prevents linking to arbitrary resource versions (no browse/bookmark 
support).  It uses GET to retrieve differences between two resource 
versions, which results in a performance loss for GET processing.

There exists a viable alternative to the use of Content-IDs and GET which 
allows linking to arbitrary versions, and which does not suffer the 
performance loss of GET.  This alternative  works better with existing 
caches than the URI+Content-ID scheme.  This alternative provides a better 
separation of concerns between retrieval of resources and retrieval of 
differences between resources.

- Jim Whitehead <ejw@ics.uci.edu>
Received on Monday, 22 September 1997 16:49:40 UTC