Re: Forward proxies and CDN/mirrors

On 19/05/2012, at 5:52 PM, Jack Bates wrote:

> Hello, I am curious to know the current thinking on HTTP forward proxies and content distribution networks, or download mirrors. What techniques are used to help forward proxies and content distribution networks play well together? What facilities are available in the HTTP protocol for this? What resources are available from the broader community of standards and best practices?

There was a lot of activity around this about ten years ago, both before and after CDNs came around. Look for papers in the late 90's / early '00s about hierarchical caching, content distribution, replication, cache peering, etc. Good place to start: <http://web.cs.wpi.edu/~webbib/webbib-date.html>.


> The approach that I am currently pursuing is to use RFC 6249, Metalink/HTTP: Mirrors and Hashes. For those content distribution networks that support it, our forward proxy listens for responses that are an HTTP redirect and have "Link: <...>; rel=duplicate" headers. If the URL in the "Location: ..." header is not already cached then we scan "Link: <...>; rel=duplicate" headers for a URL that is already cached and if found, we rewrite the "Location: ..." header with this URL

Hmm. Security is the first thing that comes to mind. That seems to be assuming that the client will be checking the hashes to make sure it actually is the thing they requested. Do you do anything to affect the stored copy from the other server?

Really, I'd want to see the combination of both the Link header and the digest before I did any rewriting, offhand.


> We are also thinking of using RFC 3230, Instance Digests in HTTP. Our proxy would listen for HTTP redirect responses that had "Digest: ..." headers. If the URL in the "Location: ..." header were not already cached then we would check if other content with the same digest were already cached. If so then we would rewrite the "Location: ..." header with the corresponding URL

To be clear, you're not talking about implementing its companion, RFC3229 (Delta Encoding), correct?

Assuming that to be the case, I think it's not a great idea to do this solely on the content of the Digest header; again for security issues, but also because you're creating a second-layer identifier for the Web, which is something we should do carefully.

Also, it's not clear offhand what Digest means on a 3xx response; at best it refers to an "anonymous" representation, rather than the thing at the other end of the Location header.


> The issue of forward proxies and content distribution networks is important to us because we run a caching proxy here at a rural village in Rwanda. Many web sites that distribute files present users with a simple download button that redirects to a download mirror, but they do not predictably redirect to the same mirror, or to a mirror that we already cached, so users can't predict whether a download will take seconds or hours, which is frustrating

Great use case - thanks for sharing that.


> Here is a proof of concept plugin [1] for the Apache Traffic Server open source caching proxy. It works just enough that given a response with a "Location: ..." header that is not already cached and a "Link: <...>; rel=duplicate" header that is already cached, it will replace the URL in the "Location: ..." header with the cached URL
> 
> I am working on this as part of the Google Summer of Code
> 
> [1] https://github.com/jablko/dedup

Fantastic!


--
Mark Nottingham   http://www.mnot.net/

Received on Monday, 21 May 2012 07:14:19 UTC