Re: improved caching in HTTP: new draft from Guille -bisho- on 2014-05-23 (ietf-http-wg@w3.org from April to June 2014)

From: Guille -bisho- <bishillo@gmail.com>
Date: Fri, 23 May 2014 10:19:36 -0700
To: Chris Drechsler <chris.drechsler@etit.tu-chemnitz.de>
Cc: ietf-http-wg@w3.org
Message-ID: <CAMSE37u06y24795N4AVWf7wwAdkrGgFWmfA5KthSQVzBaK35qQ@mail.gmail.com>
>> Etag and If-None-Match already give a conditional get feature with
>> hash that does not need a reset of the connection.
>
> You are right, Etag + If-None-Match is an option but there are some
> drawbacks:
>
> 1) The Etag is only consistent within one domain. The SHA-256 hash value in
> the Cache-NT header identifies the transfered representation absolutely
> independent of the used URLs (and therefore across domains).

I know, just saying that a If-Modified-Since is the way to go, not
closing the connection prematurely, leaving packages on the fly
belonging to a closed connection.

> 2) Caching via Etag + If-None-Match in [Part6] can only be used in
> combination with the URL. If content providers use varying URLs for one
> specific resource (e.g. due to load balancing/CDNs or session IDs in the
> query string) then the cache system stores several copies of the same
> resource.

The cache system can do whatever they want. They can freely
de-duplicate content if they wish to do so. They will of course need
to download the content before being able to cache, that's true.

>> Your proposal adds very little caching for bad practices. Nobody
>> should be using two urls for the same resource, if you need load
>> balancing between two cdns, you should be consistent and js1 always be
>> requested from site1 and js2 from site2. And this is being improved by
>> HTTP2 that will elimitate the need for sharding among domains to
>> overcome limits if parallel requests and head-of-line blocking.
>
> I agree to you, nobody should be using two URLs for the same resource. But
> reality looks different:
>
> 1) Due to load balancing/use of CDNs one specific resource is available via
> different URLs. Especially larger ISPs with connects to several Internet
> exchange points and/or several transit links are redirected to different
> server locations of the same resource. This can change within minutes due to
> changing BGP routes and/or due to load balance mechanisms of the content
> producer/CDN provider.

A single domain can be served from multiple locations without having a
different url for each one. The sites that can afford having different
datacenters to deliver content closed to users can really solve that
issue. Simply each user will resolve to a different ip depending on
location, or single ip but with multiple bgp routes.

Say you live in California. You will use google.com, no
california.google.com to grab your content, yet you can bet you are
connecting to servers very close to californa.

> 2) URLs can change for another reason: changing parameters in the query
> string. For example if content producers use personalization via session IDs
> or implement access mechanisms via parameters in the query string then the
> cache system would store several copies of the same content. Mostly caching
> is disabled in this use case by the content producer (e.g. via cache-control
> header).

And if someone is doing that kind of things, you think you will be
able to convince them to add a new http header to facilitate content
caching?

And anyway, the cache system can do de-duplication on it's own, you
can freely choose to hash the content and try to combine several cache
entries.

> The proposed caching mechanism in my draft exchanges all headers of request
> and response messages so all information like parameters in the query string
> are exchanged. There is no need to disable caching. The SHA-256 hash value
> identifies resources independent of the used URL so varying URLs don't
> matter.

Caching is disabled on resources that are subject to be private for a
user, not because any technical limitation.

So if caching is not possible is because: a) There is some reason for
it (logging, then with minimal payload and no benefits from your
proposal) b) bad usage

I agree that for bad usages your proposal might help, but the people
doing things wrong are the least likely to start adding a new http
header to they responses.

>> The Cache-NT header can only be applied  within a domain, and even
>> there is risky. A malicious user could inject malicious content with a
>> Cache-NT header that matches other resource to poison the cache. Even
>> if intermediate caches check the hash, there is still pre-image
>> attacks, won't be hard to find a collision and append malicious code
>> to a js file.
>
> I don't see how the cache can be poisoned. Can you please explain it in more
> detail?
>
> I see the following: The used SHA-256 has a strong collision resistance so
> it's nearly impossible to find two different inputs that result in the same
> hash value. When the cache system receives a response with a specific hash
> value in the Cache-NT header for the first time it computes the SHA-256
> value on the received representation in the body. If both hash value are
> equal the cache system stores a copy of the representation and uses it for
> following requests. If they are not equal then nothing is stored (but the
> response is still forwarded to the client). So the cache stores and uses
> only validated content.

Preimage attacks are the easier ones, and here we have hashes of file
contents. I can grab a js file from google, and try to find a preimage
collision suffixing js code until I found one. There has been a lot of
advances on this: https://eprint.iacr.org/2009/479.pdf

So if you want to safely use a hash cross-domain you will need to
avoid preimage, because even sha256 that is currently highly resistant
will become eventually weaker.

> One security concern is that an origin server sends a hash value that does
> not fit to the representation in the body of the response message (by
> mistake or intention). Then the client will get a different body, if the
> cache system has an cache item which fits to the hash value in the Cache-NT
> header of the response from the origin server. I think this isn't a drawback
> of my proposed caching mechanism - I think this is a problem that we have
> already today: If the origin server is compromised (or intermediates in
> between) the clients would get malicious content already today.

The cache boxes will need to check the validity of the hash to be able
to serve safely.

Imagine:
- google.com/some.js sha256=xxxx
- You could trick the user to visit first badserver.com/malicious.js
and serve it with sha256=xxx (same as google)
- The cache system stores malicious.js as content for sha256 xxx.
- user goes to google.com, but the some.js will contain malicious.js instead.

>> With Cache-NT you are only avoiding the transfer of the content, but
>> still incurring in the request to the backend server. Most of the
>> times that is the expensive part, and before you reset the connection
>> the backend would have probably sent you another 8 packets of
>> information (the recommended initcwnd is 9 by this days). If the
>> request should be cached, better get the provider to configure cache
>> properly to avoid doing the request altogether than this oportunistic
>> but dangerous way of avoiding some extra transfers over the wire.
>
>
> You are right, INITCWND can be 9 or larger and if a HTTP transfer is stopped
> then some KB will go over the wire. Therefore the proposed caching mechanism
> in my draft should only be applied for larger representations (significant
> larger than 20KB) e.g. like larger images or videos.

People serving big files a lot of times will get caching right, believe me.

> Many content providers disable caching for several reasons: implementation
> of access mechanisms (e.g. via cookies or session IDs), user tracking,
> statistics (to evaluate the usage of a service or to account advertisement),
> transferring client specific information in the query string (e.g. like
> youtube does). They all want to get the client request and disable caching
> (as in [Part6] the client request terminates at the cache in case of a cache
> hit). My draft is building a bridge: all headers are exchanged and caching
> is possible.

The way this is working currently with providers is that content
providers with enough needs place their own boxes in the internet
providers, close to the users and deliver content from there.

-- 
Guille -ℬḭṩḩø- <bishillo@gmail.com>
:wq
Received on Friday, 23 May 2014 17:20:24 UTC