Re: improved caching in HTTP: new draft from Chris Drechsler on 2014-05-28 (ietf-http-wg@w3.org from April to June 2014)

From: Chris Drechsler <chris.drechsler@etit.tu-chemnitz.de>
Date: Wed, 28 May 2014 13:51:08 +0200
To: Guille -bisho- <bishillo@gmail.com>
CC: ietf-http-wg@w3.org
Message-ID: <5385CDAC.1070101@etit.tu-chemnitz.de>
Am 23.05.2014 19:19, schrieb Guille -bisho-:
>>> Etag and If-None-Match already give a conditional get feature with
>>> hash that does not need a reset of the connection.
>>
>> You are right, Etag + If-None-Match is an option but there are some
>> drawbacks:
>>
>> 1) The Etag is only consistent within one domain. The SHA-256 hash value in
>> the Cache-NT header identifies the transfered representation absolutely
>> independent of the used URLs (and therefore across domains).
>
> I know, just saying that a If-Modified-Since is the way to go, not
> closing the connection prematurely, leaving packages on the fly
> belonging to a closed connection.
>
>> 2) Caching via Etag + If-None-Match in [Part6] can only be used in
>> combination with the URL. If content providers use varying URLs for one
>> specific resource (e.g. due to load balancing/CDNs or session IDs in the
>> query string) then the cache system stores several copies of the same
>> resource.
>
> The cache system can do whatever they want. They can freely
> de-duplicate content if they wish to do so. They will of course need
> to download the content before being able to cache, that's true.
>
>>> Your proposal adds very little caching for bad practices. Nobody
>>> should be using two urls for the same resource, if you need load
>>> balancing between two cdns, you should be consistent and js1 always be
>>> requested from site1 and js2 from site2. And this is being improved by
>>> HTTP2 that will elimitate the need for sharding among domains to
>>> overcome limits if parallel requests and head-of-line blocking.
>>
>> I agree to you, nobody should be using two URLs for the same resource. But
>> reality looks different:
>>
>> 1) Due to load balancing/use of CDNs one specific resource is available via
>> different URLs. Especially larger ISPs with connects to several Internet
>> exchange points and/or several transit links are redirected to different
>> server locations of the same resource. This can change within minutes due to
>> changing BGP routes and/or due to load balance mechanisms of the content
>> producer/CDN provider.
>
> A single domain can be served from multiple locations without having a
> different url for each one. The sites that can afford having different
> datacenters to deliver content closed to users can really solve that
> issue. Simply each user will resolve to a different ip depending on
> location, or single ip but with multiple bgp routes.
>
> Say you live in California. You will use google.com, no
> california.google.com to grab your content, yet you can bet you are
> connecting to servers very close to californa.

This mechanism is often called DNS redirection and is/was also used in 
the context of CDNs. Actually it is very coarse grained in selecting a 
server near the client because it only works on domain/subdomain level. 
 From the point of HTTP and caching it has one big advantage: it is 
absolutely transparent to the application layer.

Today the trend is to use URL rewriting/dynamic request redirection 
(e.g. have a look at how youtube redirects requests to servers in [1]) 
because it is more fine grained. Unfortunately this is not transparent 
to the application layer and results in different URLs for one specific 
content.

[1]
Adhikari, Vijay Kumar, et al. "Reverse engineering the youtube video 
delivery cloud." Proc. of IEEE Hot Topics in Media Delivery Workshop. 2011.

>> 2) URLs can change for another reason: changing parameters in the query
>> string. For example if content producers use personalization via session IDs
>> or implement access mechanisms via parameters in the query string then the
>> cache system would store several copies of the same content. Mostly caching
>> is disabled in this use case by the content producer (e.g. via cache-control
>> header).
>
> And if someone is doing that kind of things, you think you will be
> able to convince them to add a new http header to facilitate content
> caching?
>
> And anyway, the cache system can do de-duplication on it's own, you
> can freely choose to hash the content and try to combine several cache
> entries.

There are several benefits: If caching within an ISP is used then the 
content can be located much closer to the clients as it is possible from 
outside the ISP. On the side of the clients this means faster downloads 
and better QoE. ISPs can profit from reduced Interdomain traffic (which 
is often costly) and content producer can improve their QoS and can make 
use of a natural load balancing mechanism.

>> The proposed caching mechanism in my draft exchanges all headers of request
>> and response messages so all information like parameters in the query string
>> are exchanged. There is no need to disable caching. The SHA-256 hash value
>> identifies resources independent of the used URL so varying URLs don't
>> matter.
>
> Caching is disabled on resources that are subject to be private for a
> user, not because any technical limitation.
>
> So if caching is not possible is because: a) There is some reason for
> it (logging, then with minimal payload and no benefits from your
> proposal) b) bad usage
>
> I agree that for bad usages your proposal might help, but the people
> doing things wrong are the least likely to start adding a new http
> header to they responses.
>
>>> The Cache-NT header can only be applied  within a domain, and even
>>> there is risky. A malicious user could inject malicious content with a
>>> Cache-NT header that matches other resource to poison the cache. Even
>>> if intermediate caches check the hash, there is still pre-image
>>> attacks, won't be hard to find a collision and append malicious code
>>> to a js file.
>>
>> I don't see how the cache can be poisoned. Can you please explain it in more
>> detail?
>>
>> I see the following: The used SHA-256 has a strong collision resistance so
>> it's nearly impossible to find two different inputs that result in the same
>> hash value. When the cache system receives a response with a specific hash
>> value in the Cache-NT header for the first time it computes the SHA-256
>> value on the received representation in the body. If both hash value are
>> equal the cache system stores a copy of the representation and uses it for
>> following requests. If they are not equal then nothing is stored (but the
>> response is still forwarded to the client). So the cache stores and uses
>> only validated content.
>
> Preimage attacks are the easier ones, and here we have hashes of file
> contents. I can grab a js file from google, and try to find a preimage
> collision suffixing js code until I found one. There has been a lot of
> advances on this: https://eprint.iacr.org/2009/479.pdf

As the authors conclude in this document SHA256 is secure and you will 
currently not find preimage collisions (in acceptable time). I've 
checked the literature for other publications (e.g. have a look at [2]) 
but the result is always the same: no preimage collisions in acceptable 
time.

[2]
Dmitry Khovratovich, Christian Rechberger and Alexandra Savelieva 
(2011). "Bicliques for Preimages: Attacks on Skein-512 and the SHA-2 
family". IACR Cryptology ePrint Archive. 2011:286

> So if you want to safely use a hash cross-domain you will need to
> avoid preimage, because even sha256 that is currently highly resistant
> will become eventually weaker.

If SHA256 will become weaker then a new hash algorithm must be used for 
the Cache-NT header (e.g. one of the SHA3 family).

>> One security concern is that an origin server sends a hash value that does
>> not fit to the representation in the body of the response message (by
>> mistake or intention). Then the client will get a different body, if the
>> cache system has an cache item which fits to the hash value in the Cache-NT
>> header of the response from the origin server. I think this isn't a drawback
>> of my proposed caching mechanism - I think this is a problem that we have
>> already today: If the origin server is compromised (or intermediates in
>> between) the clients would get malicious content already today.
>
> The cache boxes will need to check the validity of the hash to be able
> to serve safely.

As long as SHA256 fulfill the requirements of an cryptographic hash 
algorithm then there is no need to check the validity of the hash. 
Otherwise a new hash algorithm must be used (see above).

> Imagine:
> - google.com/some.js sha256=xxxx
> - You could trick the user to visit first badserver.com/malicious.js
> and serve it with sha256=xxx (same as google)
> - The cache system stores malicious.js as content for sha256 xxx.
> - user goes to google.com, but the some.js will contain malicious.js instead.

No, that will not work. The cache system computes the hash value again 
and compares it to the hash value in the Cache-NT header. Only when both 
are equal the cache system will reuse the content for following requests.

>>> With Cache-NT you are only avoiding the transfer of the content, but
>>> still incurring in the request to the backend server. Most of the
>>> times that is the expensive part, and before you reset the connection
>>> the backend would have probably sent you another 8 packets of
>>> information (the recommended initcwnd is 9 by this days). If the
>>> request should be cached, better get the provider to configure cache
>>> properly to avoid doing the request altogether than this oportunistic
>>> but dangerous way of avoiding some extra transfers over the wire.
>>
>>
>> You are right, INITCWND can be 9 or larger and if a HTTP transfer is stopped
>> then some KB will go over the wire. Therefore the proposed caching mechanism
>> in my draft should only be applied for larger representations (significant
>> larger than 20KB) e.g. like larger images or videos.
>
> People serving big files a lot of times will get caching right, believe me.

I don't think so. Have look at youtube traffic and how they exclude 
shared caches in operator networks.

>
>> Many content providers disable caching for several reasons: implementation
>> of access mechanisms (e.g. via cookies or session IDs), user tracking,
>> statistics (to evaluate the usage of a service or to account advertisement),
>> transferring client specific information in the query string (e.g. like
>> youtube does). They all want to get the client request and disable caching
>> (as in [Part6] the client request terminates at the cache in case of a cache
>> hit). My draft is building a bridge: all headers are exchanged and caching
>> is possible.
>
> The way this is working currently with providers is that content
> providers with enough needs place their own boxes in the internet
> providers, close to the users and deliver content from there.

The problem is not that they serve the content from their boxes, the 
problem is that they disable caching. It's better to serve popular 
contents from within the ISP network (reduced Interdomain traffic, 
faster downloads, better QoS/QoE)

Thanks again for the discussion!

Chris
Received on Wednesday, 28 May 2014 11:51:34 UTC