Forward proxies and CDN/mirrors

Hello, I am curious to know the current thinking on HTTP forward proxies 
and content distribution networks, or download mirrors. What techniques 
are used to help forward proxies and content distribution networks play 
well together? What facilities are available in the HTTP protocol for 
this? What resources are available from the broader community of 
standards and best practices?

The approach that I am currently pursuing is to use RFC 6249, 
Metalink/HTTP: Mirrors and Hashes. For those content distribution 
networks that support it, our forward proxy listens for responses that 
are an HTTP redirect and have "Link: <...>; rel=duplicate" headers. If 
the URL in the "Location: ..." header is not already cached then we scan 
"Link: <...>; rel=duplicate" headers for a URL that is already cached 
and if found, we rewrite the "Location: ..." header with this URL

I would be very grateful for any feedback on this approach. What are the 
problems with this strategy? What are the alternatives? How does it 
relate to the letter or spirit of web architecture?

We are also thinking of using RFC 3230, Instance Digests in HTTP. Our 
proxy would listen for HTTP redirect responses that had "Digest: ..." 
headers. If the URL in the "Location: ..." header were not already 
cached then we would check if other content with the same digest were 
already cached. If so then we would rewrite the "Location: ..." header 
with the corresponding URL

The issue of forward proxies and content distribution networks is 
important to us because we run a caching proxy here at a rural village 
in Rwanda. Many web sites that distribute files present users with a 
simple download button that redirects to a download mirror, but they do 
not predictably redirect to the same mirror, or to a mirror that we 
already cached, so users can't predict whether a download will take 
seconds or hours, which is frustrating

Here is a proof of concept plugin [1] for the Apache Traffic Server open 
source caching proxy. It works just enough that given a response with a 
"Location: ..." header that is not already cached and a "Link: <...>; 
rel=duplicate" header that is already cached, it will replace the URL in 
the "Location: ..." header with the cached URL

I am working on this as part of the Google Summer of Code

[1] https://github.com/jablko/dedup

Received on Saturday, 19 May 2012 07:49:12 UTC