- From: Jim Whitehead <ejw@ics.uci.edu>
- Date: Tue, 23 Sep 1997 01:41:15 -0700
- To: "'Arthur van Hoff'" <avh@marimba.com>
- Cc: Push Workshop <www-push@w3.org>, DRP Mailing List <drp@marimba.com>
Hi, I've been thinking about Content-IDs recently, but in a different sense from the recent "what's the best syntax for the URIs" discussion. I've been more concerned about the overall mapping between resource versions and the URL namespace. Two assumptions underlying the use of Content-ID's in the DRP specification are: a) the same URL is used to identify multiple versions of the same resource (with each version further identified by a content id) b) use of the Content-ID and Differential-ID headers with GET are the best way to retrieve a difference between two versions of the same resource. These assumptions seem to be predicated on a third assumption that c) use of DRP should not modify the URL namespace. I think the URL+Content-ID method of identifying individual versions of a resource is far from the best solution for DRP (or WebDAV). A significant drawback to using the Content-ID header to identify a particular version of a resource is that it does not provide support for browsing the old versions. If I wanted to make an HTML page (e.g., using the <A HREF=... construct) which lists a version history for a particular resource, there is no way I can use HTML to link to the individual versions of the resource using the URL+Content-ID header identification scheme. Alternatives to the URL+Content-ID scheme do exist. One solution is to assign a separate URI to each version of a resource. Once each version has a separate URI, these versions can be linked from an HTML page. These URIs can be placed into an index file even more easily than a URI + Content-ID pair. Plus, existing HTTP/1.0 and HTTP/1.1 caches can cache these URIs with no modification to their existing cache. However, using a separate URI for each version of a resource requires a different mechanism than the Differential-ID header for retrieving differences, and also affects the namespace. Both these problems are minor, and I address them both. Diff Mechanism. Once each version has a separate URI, a DIFF method can be used to retrieve a difference between any two resources, especially the special case of two resources which are different versions of a resource. DIFF has significant performance advantages over GET. Since the Content-ID (as defined by DRP) and Differential-ID headers are unknown to current servers, these servers will need to be modified to understand them. When these headers are part of GET processing, this requires coding extensions to GET, the most frequently used method on any Web server. No matter how you implement Content-ID and Differential-ID, it will slow down processing of GET. However, if the DIFF method is used, the slowdown to GET processing (if any) is much, much smaller than modifying GET to understand new headers, possibly requiring a call to an external library. I can just hear the rebuttal now, "but Jim, we *really* want to retrieve differences with GET, since it works with the existing cache and server infrastructure." Which is true, but in a very limited context, since the existing server and proxy infrastructure does not understand the Content-ID or Differential-ID headers proposed for DRP. Given that DRP-capable proxies would need to store several versions of a resource (adding new columns to its cache lookup table), and would also possibly need to compute differences between these versions (going beyond mere Vary header support), I wonder whether supporting DRP as proposed is more difficult than supporting a new method (DIFF) and using all of the existing cache infrastructure as-is (since each version is a separate resource, and is cached separately, using its specific entity tag). Namespace. Returning to the issue of the effects on the URL namespace of having a separate URL for each version of a resource, let me address this concern by showing that there are simple solutions to this problem. One possible solution to the namespace issue (there are many possible solutions to this problem) can be addressed by making each leaf node of the original namespace hierarchy tree into an internal node (i.e., by making it a collection). The versions of each resource can then be located under its respective collection. So, for example, if I had a resource foo.html with 2 versions, each with its own Content-ID, as follows: foo.html; Content-ID: xyz1 foo.html; Content-ID: 3456 This resource can be made into a collection, and the versions of foo.html can be placed under the collection: foo.html/server_id_h4z26 (maps to old C-ID: xyz1) foo.html/server_id_h67z4 (maps to old C-ID: 3456) A GET on foo.html can be defined to return the latest version, identical to the DRP default functionality. Note that this only one of many possible ways a mapping between resource versions and URLs can be accomplished. Note also that the same underlying delta storage technique can be used for the original DRP proposal as for this proposal -- this is just suggesting a different mapping of the storage to the namespace. In this namespace scheme, the URIs are opaque, and completely under the control of the server. To summarize: The URL+Content-IDs scheme proposed in DRP is a suboptimal solution to the problem of mapping resource versions into the URL space. This solution prevents linking to arbitrary resource versions (no browse/bookmark support). It uses GET to retrieve differences between two resource versions, which results in a performance loss for GET processing. There exists a viable alternative to the use of Content-IDs and GET which allows linking to arbitrary versions, and which does not suffer the performance loss of GET. This alternative works better with existing caches than the URI+Content-ID scheme. This alternative provides a better separation of concerns between retrieval of resources and retrieval of differences between resources. - Jim Whitehead <ejw@ics.uci.edu>
Received on Monday, 22 September 1997 16:49:40 UTC