Re: improved caching in HTTP: new draft from Amos Jeffries on 2014-05-31 (ietf-http-wg@w3.org from April to June 2014)

From: Amos Jeffries <squid3@treenet.co.nz>
Date: Sun, 01 Jun 2014 00:05:46 +1200
To: ietf-http-wg@w3.org
Message-ID: <5389C59A.9060909@treenet.co.nz>
On 28/05/2014 11:51 p.m., Chris Drechsler wrote:
> Am 23.05.2014 19:19, schrieb Guille -bisho-:
>>>> Etag and If-None-Match already give a conditional get feature with
>>>> hash that does not need a reset of the connection.
>>>
>>> You are right, Etag + If-None-Match is an option but there are some
>>> drawbacks:
>>>
>>> 1) The Etag is only consistent within one domain. The SHA-256 hash
>>> value in
>>> the Cache-NT header identifies the transfered representation absolutely
>>> independent of the used URLs (and therefore across domains).
>>
>> I know, just saying that a If-Modified-Since is the way to go, not
>> closing the connection prematurely, leaving packages on the fly
>> belonging to a closed connection.
>>
>>> 2) Caching via Etag + If-None-Match in [Part6] can only be used in
>>> combination with the URL. If content providers use varying URLs for one
>>> specific resource (e.g. due to load balancing/CDNs or session IDs in the
>>> query string) then the cache system stores several copies of the same
>>> resource.
>>
>> The cache system can do whatever they want. They can freely
>> de-duplicate content if they wish to do so. They will of course need
>> to download the content before being able to cache, that's true.
>>
>>>> Your proposal adds very little caching for bad practices. Nobody
>>>> should be using two urls for the same resource, if you need load
>>>> balancing between two cdns, you should be consistent and js1 always be
>>>> requested from site1 and js2 from site2. And this is being improved by
>>>> HTTP2 that will elimitate the need for sharding among domains to
>>>> overcome limits if parallel requests and head-of-line blocking.
>>>
>>> I agree to you, nobody should be using two URLs for the same
>>> resource. But
>>> reality looks different:
>>>
>>> 1) Due to load balancing/use of CDNs one specific resource is
>>> available via
>>> different URLs. Especially larger ISPs with connects to several Internet
>>> exchange points and/or several transit links are redirected to different
>>> server locations of the same resource. This can change within minutes
>>> due to
>>> changing BGP routes and/or due to load balance mechanisms of the content
>>> producer/CDN provider.
>>
>> A single domain can be served from multiple locations without having a
>> different url for each one. The sites that can afford having different
>> datacenters to deliver content closed to users can really solve that
>> issue. Simply each user will resolve to a different ip depending on
>> location, or single ip but with multiple bgp routes.
>>
>> Say you live in California. You will use google.com, no
>> california.google.com to grab your content, yet you can bet you are
>> connecting to servers very close to californa.
> 
> This mechanism is often called DNS redirection and is/was also used in
> the context of CDNs. Actually it is very coarse grained in selecting a
> server near the client because it only works on domain/subdomain level.
> From the point of HTTP and caching it has one big advantage: it is
> absolutely transparent to the application layer.
> 
> Today the trend is to use URL rewriting/dynamic request redirection
> (e.g. have a look at how youtube redirects requests to servers in [1])
> because it is more fine grained. Unfortunately this is not transparent
> to the application layer and results in different URLs for one specific
> content.

YouTube is a good case to use here. Numerous people have been working on
ways to cache their content for years. Every time a method is found and
published the YT systems mysteriously change in ways which happen to
break just that published caching mechanism.

In fact the first caching method which I know of having been broken
explicitly by their changes was to identify a SHA hash of the video file
in the URL and cache the file under that hash until another URL was
fetched with the same hash. Sound familiar? these days the videos are
VBR encoded live to avoid hashes being duplicated even for consecutive
requests to the same server with the same URL and HTTP request headers.

What does that tell you about that and similar services willingness to
pick up your proposed mechanism?

> 
> [1]
> Adhikari, Vijay Kumar, et al. "Reverse engineering the youtube video
> delivery cloud." Proc. of IEEE Hot Topics in Media Delivery Workshop. 2011.

3 years old. YT have changed mechanism at least 3 times since that was
published.


> 
>>> 2) URLs can change for another reason: changing parameters in the query
>>> string. For example if content producers use personalization via
>>> session IDs
>>> or implement access mechanisms via parameters in the query string
>>> then the
>>> cache system would store several copies of the same content. Mostly
>>> caching
>>> is disabled in this use case by the content producer (e.g. via
>>> cache-control
>>> header).
>>
>> And if someone is doing that kind of things, you think you will be
>> able to convince them to add a new http header to facilitate content
>> caching?
>>
>> And anyway, the cache system can do de-duplication on it's own, you
>> can freely choose to hash the content and try to combine several cache
>> entries.
> 
> There are several benefits: If caching within an ISP is used then the
> content can be located much closer to the clients as it is possible from
> outside the ISP. On the side of the clients this means faster downloads
> and better QoE. ISPs can profit from reduced Interdomain traffic (which
> is often costly) and content producer can improve their QoS and can make
> use of a natural load balancing mechanism.

Those are the same benefits of the existing HTTP mechanisms. There is no
change towards any gain from your proposal in any of those respects.

> 
>>> The proposed caching mechanism in my draft exchanges all headers of
>>> request
>>> and response messages so all information like parameters in the query
>>> string
>>> are exchanged. There is no need to disable caching. The SHA-256 hash
>>> value
>>> identifies resources independent of the used URL so varying URLs don't
>>> matter.
>>
>> Caching is disabled on resources that are subject to be private for a
>> user, not because any technical limitation.
>>
>> So if caching is not possible is because: a) There is some reason for
>> it (logging, then with minimal payload and no benefits from your
>> proposal) b) bad usage
>>
>> I agree that for bad usages your proposal might help, but the people
>> doing things wrong are the least likely to start adding a new http
>> header to they responses.
>>
>>>> The Cache-NT header can only be applied  within a domain, and even
>>>> there is risky. A malicious user could inject malicious content with a
>>>> Cache-NT header that matches other resource to poison the cache. Even
>>>> if intermediate caches check the hash, there is still pre-image
>>>> attacks, won't be hard to find a collision and append malicious code
>>>> to a js file.
>>>
>>> I don't see how the cache can be poisoned. Can you please explain it
>>> in more
>>> detail?
>>>
>>> I see the following: The used SHA-256 has a strong collision
>>> resistance so
>>> it's nearly impossible to find two different inputs that result in
>>> the same
>>> hash value. When the cache system receives a response with a specific
>>> hash
>>> value in the Cache-NT header for the first time it computes the SHA-256
>>> value on the received representation in the body. If both hash value are
>>> equal the cache system stores a copy of the representation and uses
>>> it for
>>> following requests. If they are not equal then nothing is stored (but
>>> the
>>> response is still forwarded to the client). So the cache stores and uses
>>> only validated content.
>>
>> Preimage attacks are the easier ones, and here we have hashes of file
>> contents. I can grab a js file from google, and try to find a preimage
>> collision suffixing js code until I found one. There has been a lot of
>> advances on this: https://eprint.iacr.org/2009/479.pdf
> 
> As the authors conclude in this document SHA256 is secure and you will
> currently not find preimage collisions (in acceptable time). I've
> checked the literature for other publications (e.g. have a look at [2])
> but the result is always the same: no preimage collisions in acceptable
> time.
> 
> [2]
> Dmitry Khovratovich, Christian Rechberger and Alexandra Savelieva
> (2011). "Bicliques for Preimages: Attacks on Skein-512 and the SHA-2
> family". IACR Cryptology ePrint Archive. 2011:286
> 
>> So if you want to safely use a hash cross-domain you will need to
>> avoid preimage, because even sha256 that is currently highly resistant
>> will become eventually weaker.
> 
> If SHA256 will become weaker then a new hash algorithm must be used for
> the Cache-NT header (e.g. one of the SHA3 family).

None of this matters. The attackers is free to deliver an arbitrary
compressed (or Range) version of an object stating that it hash is XYZ
on the identity form. The cache is not able to verify the hash accuracy
yet is expected to deliver that object to all clients who request
*other* URLs which have valid hash XZY.

Finding a collision is not required to generate a false "collision".

All an attacker needs to do is:
step 1) request the object they want to attack, see its hash header, and
replay that header on their attack payload object.
step 2) get the cache expiry details (in HTTP headers) of any valid
content at the victum URL,
step 3) fetch the attack URL repeatedly at the time of expiry, until the
attack payload is confirmed cached under the given hash.

 Voila! arbitrary content corruption from time X until the cache has
enough "slow down period" to validate the hash independently.


> 
>>> One security concern is that an origin server sends a hash value that
>>> does
>>> not fit to the representation in the body of the response message (by
>>> mistake or intention). Then the client will get a different body, if the
>>> cache system has an cache item which fits to the hash value in the
>>> Cache-NT
>>> header of the response from the origin server. I think this isn't a
>>> drawback
>>> of my proposed caching mechanism - I think this is a problem that we
>>> have
>>> already today: If the origin server is compromised (or intermediates in
>>> between) the clients would get malicious content already today.
>>
>> The cache boxes will need to check the validity of the hash to be able
>> to serve safely.
> 
> As long as SHA256 fulfill the requirements of an cryptographic hash
> algorithm then there is no need to check the validity of the hash.
> Otherwise a new hash algorithm must be used (see above).

Indeed see above. If the cache does not validate the hash independent of
the sender then it does not even have to go to the bother of hiding its
attack object inside a content-encoding compression the cache does not
understand.


> 
>> Imagine:
>> - google.com/some.js sha256=xxxx
>> - You could trick the user to visit first badserver.com/malicious.js
>> and serve it with sha256=xxx (same as google)
>> - The cache system stores malicious.js as content for sha256 xxx.
>> - user goes to google.com, but the some.js will contain malicious.js
>> instead.
> 
> No, that will not work. The cache system computes the hash value again
> and compares it to the hash value in the Cache-NT header. Only when both
> are equal the cache system will reuse the content for following requests.
> 

"As long as SHA256 fulfill the requirements of an cryptographic hash
algorithm then there is no need to check the validity of the hash".

So surely the cache does not need to re-compute and check the validity
of the hash sent to it?

You contradict yoruself.


>>>> With Cache-NT you are only avoiding the transfer of the content, but
>>>> still incurring in the request to the backend server. Most of the
>>>> times that is the expensive part, and before you reset the connection
>>>> the backend would have probably sent you another 8 packets of
>>>> information (the recommended initcwnd is 9 by this days). If the
>>>> request should be cached, better get the provider to configure cache
>>>> properly to avoid doing the request altogether than this oportunistic
>>>> but dangerous way of avoiding some extra transfers over the wire.
>>>
>>>
>>> You are right, INITCWND can be 9 or larger and if a HTTP transfer is
>>> stopped
>>> then some KB will go over the wire. Therefore the proposed caching
>>> mechanism
>>> in my draft should only be applied for larger representations
>>> (significant
>>> larger than 20KB) e.g. like larger images or videos.
>>
>> People serving big files a lot of times will get caching right,
>> believe me.
> 
> I don't think so. Have look at youtube traffic and how they exclude
> shared caches in operator networks.

YT are a special case. They are *actively* changing things in ways that
prevent caching. For unexplained reasons.


>>
>>> Many content providers disable caching for several reasons:
>>> implementation
>>> of access mechanisms (e.g. via cookies or session IDs), user tracking,
>>> statistics (to evaluate the usage of a service or to account
>>> advertisement),
>>> transferring client specific information in the query string (e.g. like
>>> youtube does). They all want to get the client request and disable
>>> caching
>>> (as in [Part6] the client request terminates at the cache in case of
>>> a cache
>>> hit). My draft is building a bridge: all headers are exchanged and
>>> caching
>>> is possible.
>>
>> The way this is working currently with providers is that content
>> providers with enough needs place their own boxes in the internet
>> providers, close to the users and deliver content from there.
> 
> The problem is not that they serve the content from their boxes, the
> problem is that they disable caching. It's better to serve popular
> contents from within the ISP network (reduced Interdomain traffic,
> faster downloads, better QoS/QoE)

The answer then is to get them to stop actively disabling caching and
start using the available caching mechanisms properly. For both their
benefit and everyone elses. This is not a new issue and the problem is
generally one of developer understanding and/or business policy - not
lack of caching mechanisms.

Amos
Received on Saturday, 31 May 2014 12:06:19 UTC