Re: Forward proxies and CDN/mirrors from Jack Bates on 2012-06-14 (ietf-http-wg@w3.org from April to June 2012)

From: Jack Bates <jzej8k@nottheoilrig.com>
Date: Thu, 14 Jun 2012 00:16:50 -0700
To: Mark Nottingham <mnot@mnot.net>
CC: HTTP Working Group <ietf-http-wg@w3.org>, Anthony Bryan <anthonybryan@gmail.com>, Leif Hedstrom <zwoop@apache.org>
Message-ID: <4FD98FE2.8070602@nottheoilrig.com>
Thank you very much for your time and for your detailed response to this 
question, Mark. Thank you for being so helpful

On 21/05/12 12:13 AM, Mark Nottingham wrote:
>
> On 19/05/2012, at 5:52 PM, Jack Bates wrote:
>
>> Hello, I am curious to know the current thinking on HTTP forward proxies and content distribution networks, or download mirrors. What techniques are used to help forward proxies and content distribution networks play well together? What facilities are available in the HTTP protocol for this? What resources are available from the broader community of standards and best practices?
>
> There was a lot of activity around this about ten years ago, both before and after CDNs came around. Look for papers in the late 90's / early '00s about hierarchical caching, content distribution, replication, cache peering, etc. Good place to start:<http://web.cs.wpi.edu/~webbib/webbib-date.html>.

Thanks for this advice and for this comprehensive list of references


>> The approach that I am currently pursuing is to use RFC 6249, Metalink/HTTP: Mirrors and Hashes. For those content distribution networks that support it, our forward proxy listens for responses that are an HTTP redirect and have "Link:<...>; rel=duplicate" headers. If the URL in the "Location: ..." header is not already cached then we scan "Link:<...>; rel=duplicate" headers for a URL that is already cached and if found, we rewrite the "Location: ..." header with this URL
>
> Hmm. Security is the first thing that comes to mind. That seems to be assuming that the client will be checking the hashes to make sure it actually is the thing they requested. Do you do anything to affect the stored copy from the other server?
>
> Really, I'd want to see the combination of both the Link header and the digest before I did any rewriting, offhand.

"I'd want to see the combination of both the Link header and the digest 
before I did any rewriting, offhand." I think you mean that the proxy 
should check both that the URL from the Link header already exists in 
the cache and that digest of the cached content matches the Digest 
header, before rewriting the Location header with this URL. Good point, 
I will add a check that the digest matches

"Do you do anything to affect the stored copy from the other server?" I 
think you mean, does the proxy in any way alter the cached copy of the 
content from the URL in the Link header. It does not, and I guess this 
is important because if it did, it would interfere with the client 
checking the hashes?


>> We are also thinking of using RFC 3230, Instance Digests in HTTP. Our proxy would listen for HTTP redirect responses that had "Digest: ..." headers. If the URL in the "Location: ..." header were not already cached then we would check if other content with the same digest were already cached. If so then we would rewrite the "Location: ..." header with the corresponding URL
>
> To be clear, you're not talking about implementing its companion, RFC3229 (Delta Encoding), correct?

Correct

> Assuming that to be the case, I think it's not a great idea to do this solely on the content of the Digest header; again for security issues, but also because you're creating a second-layer identifier for the Web, which is something we should do carefully.
>
> Also, it's not clear offhand what Digest means on a 3xx response; at best it refers to an "anonymous" representation, rather than the thing at the other end of the Location header.

I think I understand what you mean by second-layer identifier. The 
primary identifier of the web is the URI, so looking up content by 
digest is a second-layer. But I'm not familiar with the implications of 
a second-layer identifier. To my novice understanding, looking up cached 
content by digest seems innocuous and helpful

I also follow when you say that it's not clear what Digest means on a 
3XX response, rather than the thing at the other end of the Location 
header. But I don't understand well what is an "anonymous" representation

I will continue reading to better understand second-layer identifier and 
"anonymous" representation. Any pointers welcome


>> The issue of forward proxies and content distribution networks is important to us because we run a caching proxy here at a rural village in Rwanda. Many web sites that distribute files present users with a simple download button that redirects to a download mirror, but they do not predictably redirect to the same mirror, or to a mirror that we already cached, so users can't predict whether a download will take seconds or hours, which is frustrating
>
> Great use case - thanks for sharing that.
>
>
>> Here is a proof of concept plugin [1] for the Apache Traffic Server open source caching proxy. It works just enough that given a response with a "Location: ..." header that is not already cached and a "Link:<...>; rel=duplicate" header that is already cached, it will replace the URL in the "Location: ..." header with the cached URL
>>
>> I am working on this as part of the Google Summer of Code
>>
>> [1] https://github.com/jablko/dedup
>
> Fantastic!
>
>
> --
> Mark Nottingham   http://www.mnot.net/
>
Received on Thursday, 14 June 2012 07:13:09 UTC