Re: improved caching in HTTP: new draft from Chris Drechsler on 2014-05-23 (ietf-http-wg@w3.org from April to June 2014)

From: Chris Drechsler <chris.drechsler@etit.tu-chemnitz.de>
Date: Fri, 23 May 2014 10:06:04 +0200
To: Guille -bisho- <bishillo@gmail.com>
CC: ietf-http-wg@w3.org
Message-ID: <537F016C.3030407@etit.tu-chemnitz.de>
Hi Guille,

thank you for reading my draft and giving me feedback - this is very 
helpful! My answers are below:

Am 21.05.2014 18:10, schrieb Guille -bisho-:
> My 2 cents after reading the draft:
>
> Etag and If-None-Match already give a conditional get feature with
> hash that does not need a reset of the connection.

You are right, Etag + If-None-Match is an option but there are some 
drawbacks:

1) The Etag is only consistent within one domain. The SHA-256 hash value 
in the Cache-NT header identifies the transfered representation 
absolutely independent of the used URLs (and therefore across domains).

2) Caching via Etag + If-None-Match in [Part6] can only be used in 
combination with the URL. If content providers use varying URLs for one 
specific resource (e.g. due to load balancing/CDNs or session IDs in the 
query string) then the cache system stores several copies of the same 
resource.

> Your proposal adds very little caching for bad practices. Nobody
> should be using two urls for the same resource, if you need load
> balancing between two cdns, you should be consistent and js1 always be
> requested from site1 and js2 from site2. And this is being improved by
> HTTP2 that will elimitate the need for sharding among domains to
> overcome limits if parallel requests and head-of-line blocking.

I agree to you, nobody should be using two URLs for the same resource. 
But reality looks different:

1) Due to load balancing/use of CDNs one specific resource is available 
via different URLs. Especially larger ISPs with connects to several 
Internet exchange points and/or several transit links are redirected to 
different server locations of the same resource. This can change within 
minutes due to changing BGP routes and/or due to load balance mechanisms 
of the content producer/CDN provider.

2) URLs can change for another reason: changing parameters in the query 
string. For example if content producers use personalization via session 
IDs or implement access mechanisms via parameters in the query string 
then the cache system would store several copies of the same content. 
Mostly caching is disabled in this use case by the content producer 
(e.g. via cache-control header).

The proposed caching mechanism in my draft exchanges all headers of 
request and response messages so all information like parameters in the 
query string are exchanged. There is no need to disable caching. The 
SHA-256 hash value identifies resources independent of the used URL so 
varying URLs don't matter.

> The Cache-NT header can only be applied  within a domain, and even
> there is risky. A malicious user could inject malicious content with a
> Cache-NT header that matches other resource to poison the cache. Even
> if intermediate caches check the hash, there is still pre-image
> attacks, won't be hard to find a collision and append malicious code
> to a js file.

I don't see how the cache can be poisoned. Can you please explain it in 
more detail?

I see the following: The used SHA-256 has a strong collision resistance 
so it's nearly impossible to find two different inputs that result in 
the same hash value. When the cache system receives a response with a 
specific hash value in the Cache-NT header for the first time it 
computes the SHA-256 value on the received representation in the body. 
If both hash value are equal the cache system stores a copy of the 
representation and uses it for following requests. If they are not equal 
then nothing is stored (but the response is still forwarded to the 
client). So the cache stores and uses only validated content.

One security concern is that an origin server sends a hash value that 
does not fit to the representation in the body of the response message 
(by mistake or intention). Then the client will get a different body, if 
the cache system has an cache item which fits to the hash value in the 
Cache-NT header of the response from the origin server. I think this 
isn't a drawback of my proposed caching mechanism - I think this is a 
problem that we have already today: If the origin server is compromised 
(or intermediates in between) the clients would get malicious content 
already today.

What do you think?

> With Cache-NT you are only avoiding the transfer of the content, but
> still incurring in the request to the backend server. Most of the
> times that is the expensive part, and before you reset the connection
> the backend would have probably sent you another 8 packets of
> information (the recommended initcwnd is 9 by this days). If the
> request should be cached, better get the provider to configure cache
> properly to avoid doing the request altogether than this oportunistic
> but dangerous way of avoiding some extra transfers over the wire.

You are right, INITCWND can be 9 or larger and if a HTTP transfer is 
stopped then some KB will go over the wire. Therefore the proposed 
caching mechanism in my draft should only be applied for larger 
representations (significant larger than 20KB) e.g. like larger images 
or videos.

Many content providers disable caching for several reasons: 
implementation of access mechanisms (e.g. via cookies or session IDs), 
user tracking, statistics (to evaluate the usage of a service or to 
account advertisement), transferring client specific information in the 
query string (e.g. like youtube does). They all want to get the client 
request and disable caching (as in [Part6] the client request terminates 
at the cache in case of a cache hit). My draft is building a bridge: all 
headers are exchanged and caching is possible.

> Guille -bisho-
> <bisho@freedreams.org|fb.com>
> :wq
>

Thank you again for reading my draft and taking the time! I'm really 
looking forward for your answer.

Chris
Received on Friday, 23 May 2014 08:06:38 UTC