HTTP Extensions for a Content-Addressable Web

Hello all,

Some of you on the decentralization@yahoogroups list my have already 
seen this proposal, but I wanted to get some feedback from this group.

This technology is born out of the peer-to-peer space, specifically 
Swarmcast and Gnutella.  A nice backgrounder on Swarmcast can be found 
at http://www.openp2p.com/pub/a/p2p/2001/05/24/swarmcast_beta.html

An overview of the CAW can be found at http://onionnetworks.com/caw/ and 
I have attached a copy of "HTTP Extensions for a Content-Addressable Web"

I am open to any and all suggestions and don't feel particularily 
strongly about the header names I chose in the document.

So without further adue...rip it to shreds.


-- 
Justin Chapweske, Onion Networks
http://onionnetworks.com/
HTTP Extensions for a Content-Addressable Web

Justin Chapweske, Onion Networks (justin@onionnetworks.com)

October 24, 2001

Abstract

The goal of the Content-Addressable Web (CAW) is to enable
the creation of advanced content location and distribution
services over HTTP. The use of content addressing allows
advanced caching techniques to be employed, and sets the
foundation for creating ad hoc Content Distribution Networks
(CDNs). This document specifies HTTP extensions that bridge
the current location-based Web with the Content-Addressable
Web. 

Table of Contents

1 Introduction
    1.1 Optimal Replicas
    1.2 Untrusted Caches
    1.3 Transient Web
2 Scope
3 Content Addressing
4 HTTP Extensions
    4.1 X-Content-URN
    4.2 X-URI-RES
        4.2.1 N2R
        4.2.2 N2L and N2Ls
5 Proxies and Redirectors
6 An Example Application
7 Replica Advertisement
8 Acknowledgments



1 Introduction

Content Distribution Networks (CDNs), such as Akamai, have
shown that significant improvements can be made in throughput,
latency, and scalability when content is distributed throughout
the network and delivered from the edge. Likewise, peer-to-peer
systems such as Napster and Gnutella have shown that normal
desktop PCs can serve up enormous amounts of content with
zero administration. And more recently, systems like Swarmcast
have been introduced that combine the CDN and peer-to-peer
concepts to gain the benefits of both. The goal of the Content-Addressable
Web is to enable these advanced content location and distribution
services with standard web servers, caches, and browsers. 

There are a number of short-comings of current web architecture
that the Content-Addressable Web aims to overcome. These
include discovering optimal replicas, downloading from untrusted
caches, and distributing content across the Transient Web.

1.1 Optimal Replicas

There are currently no mechanisms within HTTP that allows
a user-agent to discover an optimal replica for a piece
of content. This problem is due to the fact that HTTP caching
practice assumes a hierarchical caching structure where
each user has a single parent cache. Thus while one can
discover an object's source URI from a cached copy, there
is no mechanism to discover a list of replica locations
from the source. This problem is evidenced by the fact that
users must manually select the closest mirrors when downloading
from Tucows, FilePlanet, or the various Linux distributions.
The CAW solves this problem by providing distributed URI
resolvers that user-agents can query to find an optimal
replica.

1.2 Untrusted Caches

It is currently unsafe to download web objects from an untrusted
cache or mirror because they can modify/corrupt the content
at will. This becomes particularly problematic when trying
to create public cooperative caching systems. This isn't
a problem for private CDNs, like Akamai, where all of their
servers are under Akamai's control and are assumed to be
secure. But for a public CDN, the goal is to allow user-agents
to retrieve content from completely untrusted hosts but
be assured that they are receiving the content intact. The
CAW solves this problem by using content addressing that
includes integrity checking information.

1.3 Transient Web

The Transient Web is a relatively new phenomenon that is
growing in size and importance. It is embodied by peer-to-peer
systems such as Gnutella, and is characterized by unreliable
hosts with rapidly changing locations and content. These
characteristics make location-based addresses within the
Transient Web quite brittle. Even if traditional HTTP caching
was widely leveraged within the Transient Web, the situation
wouldn't be helped much. This is because a single piece
of content will often be available under many different
URIs, which creates disjoint and inefficient caching hierarchies. 

This multiplicity of URIs occurs for a couple of reasons:

* The original source for a piece of content will often cease
  to exist or the source's URI will change.

* Multiple independent sources often introduce the same content
  into the network.

* Most applications and file manipulation tools will tend
  to "forget" the source URI of a piece
  of content.

This URI multiplicity can also occur in the normal web, although
it is RECOMMENDED that caching semantics be used when an
authoritative source is known. The CAW solves the above
problems by providing content-specific URIs that are location-independent
and can be independently generated by any host. Additionally,
various URI resolution services work in coordination to
resolve issues associated with having multiple URIs for
a web object.

2 Scope

The HTTP extensions for CAW are intended to be used for in
the above scenarios where HTTP is currently lacking. This
technology is focused on mostly static content that can
benefit from advanced content distribution services. The
extensions are intended to be hidden under the hood of web
servers, caches, and browsers and should change nothing
as far as end users are concerned. So even though a new
URN scheme is introduced, there are very few situations
where a human will ever interact with those URNs. 

One of the more interesting applications of the Content-Addressable
Web is the creation of ad hoc Content Distribution Networks.
In such networks, receivers can crawl across the network,
searching for optimal replicas, and then downloading content
from multiple replicas in parallel. After a host has downloaded
the content, it then advertises itself as a replica, automatically
becoming a part of the CDN. 

3 Content Addressing

This specification introduces a URI scheme with many interesting
capabilities for solving the problems discussed earlier.
A particularly useful class of URI schemes are "Self-Verifiable
URIs". These are URIs with which the URI itself can
be used to verify that the content has been received intact.
We also want URIs that are content-specific and can be independently
generated by any host with the content. Finally, to show
the intent that these addresses are location-independent,
a URN scheme will be used. 

Cryptographic hashes of the content provide the capabilities
that we are looking for. For example we can take the SHA-1
hash of a piece of content and then encode it using Base32
to provide the following URN:

urn:sha1:RMUVHIRSGUU3VU7FJWRAKW3YWG2S2RFB

* Implementations MUST support SHA-1 URNs at minimum.([footnote] A future version of this document will also specify
a URN format for performing streaming and random-access
verification using Merkle Hash Trees.) 

* Receivers MUST verify self-verifiable URIs if any part
  of the content is retrieved from a potentially untrusted
  source.

4 HTTP Extensions

In order to provide a bridge between the location-based Web
and the Content-Addressable Web, a few HTTP extensions must
be introduced. The nature of these extensions is that they
need not be widely deployed in order to be useful. They
are specifically designed to allow for proxying for hosts
that are not CAW-aware.

* The following HTTP extensions are based off of the conventions
  defined in RFC 2169. It is RECOMMENDED that implementers
  of this specification also implement RFC 2169.

* The HTTP headers defined in this specification are all
  response headers. No additional request headers are specified
  by this document.

* It is RECOMMENDED that implementers of this specification
  use an HTTP/1.1 implementation compliant with RFC 2616.

This specification uses the "X-"
header prefix convention to denote that these are not W3C/IETF
standard headers. If and when this specification becomes
a standard, the prefix will either be simply removed or
replaced with an appropriate header extension mechanism.

4.1 X-Content-URN

The X-Content-URN entity-header field provides one or more
URNs that uniquely identify the entity-body. The URN is
based on the content of the entity-body and any content-coding
that has been applied, but not including any transfer-encoding
applied to the message-body. For example:

X-Content-URN: urn:sha1:RMUVHIRSGUU3VU7FJWRAKW3YWG2S2RFB

4.2 X-URI-RES

The X-URI-RES header is based off of conventions defined
in RFC 2169 and provides a number of flexible URI resolution
services. These headers provide various ways of locating
other content replicas, including additional sources for
a multiple-source download. One can also build an application
that crawls across the resolution services searching for
an optimal replica. Many other uses can be imagined beyond
those given in this specification. The general form of the
header is as follows:

X-URI-RES: <service uri> ; <service type> [; target uri]

* The service URI specifies the URI of the resolution service.
  It is not necessary for the service URI to conform to
  "/uri-res/<service>?<uri>"
  convention specified in RFC 2169.

* The service type identifies what type of resolution is
  being performed and how to interpret the results from
  the service URI. The types are those defined in RFC 2169
  and include "N2L", "N2Ls",
  "N2R", "N2Rs", "N2C",
  "N2Cs", "N2Ns", "L2Ns",
  "L2Ls", and "L2C". 

* The target URI is the URI upon which the resolution service
  will be performed. The target URI can be any URI and is
  specifically not limited to the URI specified by the X-Content-URN
  header. If there is only a single X-Content-URN value,
  the target URI can be left off to imply that the X-Content-URN
  value is to be resolved.

* It is RECOMMENDED that receivers assume that the URI resolver
  services are potentially untrusted and should verify all
  content retrieved using a resolver's services. 

It is believed that N2R, N2L, and N2Ls will be the most useful
services for the Content-Addressable Web, so we will cover
examples of those explicitly.

4.2.1 N2R

The N2R URIs directly specify mirrors for the content addressed
by the URN and can be useful for multi-source downloads.
For example: 

X-URI-RES: http://urnresolver.com/uri-res/N2R?urn:sha1:<base32>;
N2R

or

X-URI-RES: http://untrustedmirror.com/pub/file.zip; N2R

The key difference between these headers and something like
the Location header is that the URIs specified by this header
should be assumed to be untrusted.

4.2.2 N2L and N2Ls

These headers are used when other hosts provide URLs where
the content is mirrored. This is most useful in ad hoc CDNs
where mirrors may maintain lists of other mirrors. Browsers
can simply crawl across the networks, recursively dereferencing
N2L(s). For example:

X-URI-RES: http://urnresolver.com/uri-res/N2L?urn:sha1:<base32>;
N2L

and

X-URI-RES: http://untrustedmirror.com/pub/file-mirrors.list;
N2Ls; urn:sha1<base32>

For the N2Ls service, it is RECOMMENDED that the result conform
to the text/uri-list media type specified in RFC 2169.

5 Proxies and Redirectors

It is useful to allow CAW-aware proxies that provide content-addressing
information without modifying the original web server. This
allows CAW-aware user-agents to take advantage of the headers,
while simply redirecting user-agents that don't understand
the Content-Addressable Web. It would be inappropriate to
return an X-Content-URN header during a redirect, because
HTTP 3xx responses often still include a message-body that
explains that a redirect is taking place. Instead it is
preferred to return a result of the text/uri-list media
type that includes one or more URNs that would normally
reside in the X-Content-URN header.

6 An Example Application

The above HTTP extensions are deceptively simple and it may
not be readily apparent how powerful they are. We will discuss
an example application that will take advantage of a few
of the features provided by the extensions. 

In this example we will will look at how the CAW could help
at linuxiso.org where ISO CD-ROM images of the various linux
distributions are kept. The first step will be to issue
a GET request for the content:

GET /pub/Redhat-7.1-i386-disc1.iso HTTP/1.1
Host: www.linuxiso.org 


The abbreviated response:

HTTP/1.1 200 OK
Content-Type: Application/octet-stream
Content-Length: 662072345
X-Content-URN: urn:sha1:RMUVHIRSGUU3VU7FJWRAKW3YWG2S2RFB
X-URI-RES: http://www.linuxmirrors.com/pub/Redhat-7.1i386-disc1.iso
; N2R
X-URI-RES: http://123.24.24.21:8080/uri-res/N2R?urn:sha1:<base32>;
N2R
X-URI-RES: http://123.24.24.21:8080/uri-res/N2Ls?urn:sha1:<base32>;
N2Ls


With this response, a CAW aware browser can immediately begin
downloading the content from www.linuxiso.org, linuxmirrors.com,
and 123.24.24.21 all in parallel. At the same time the browser
can be dereferencing the N2Ls service at 123.24.24.21 to
discover more mirrors for the content.

The existence of the 123...21 host is meant to represent
a member of an ad hoc CDN, perhaps the personal computer
of a linux advocate that just downloaded the ISO and wants
to share their bandwidth with others. By dereferencing the
N2Ls, even more ad hoc nodes could be discovered.

7 Replica Advertisement

The URI-RES framework provides a significant amount of flexibility
in how replica advertisement and discovery can be implemented.
One example implementation will be provided in a future
specification.

8 Acknowledgments

Gordon Mohr (gojomo@bitzi.com), Tony Kimball (alk@pobox.com),
Mark Baker (distobj@acm.org)

Received on Thursday, 29 November 2001 01:54:45 UTC