Re: improved caching in HTTP: new draft from Guille -bisho- on 2014-06-02 (ietf-http-wg@w3.org from April to June 2014)

From: Guille -bisho- <bishillo@gmail.com>
Date: Mon, 2 Jun 2014 12:17:42 -0700
To: Chris Drechsler <chris.drechsler@etit.tu-chemnitz.de>
Cc: ietf-http-wg@w3.org
Message-ID: <CAMSE37tiUy10XMPjPEF8bTSqBE9sJkv+p++X5CBP_5wd5zFzqQ@mail.gmail.com>
> This mechanism is often called DNS redirection and is/was also used in the
> context of CDNs. Actually it is very coarse grained in selecting a server
> near the client because it only works on domain/subdomain level. From the
> point of HTTP and caching it has one big advantage: it is absolutely
> transparent to the application layer.
>
> Today the trend is to use URL rewriting/dynamic request redirection (e.g.
> have a look at how youtube redirects requests to servers in [1]) because it
> is more fine grained. Unfortunately this is not transparent to the
> application layer and results in different URLs for one specific content.

Because for a youtube video the cost of a redirect is minimal compared
to the savings of serving the content from a choosen location. But by
doing this, they benefit much less from caching because they do have
content available from many locations close to the users already.

For other contents like images that have much higher needs of low
latency those redirects/multiple urls are not used.

>> And if someone is doing that kind of things, you think you will be
>> able to convince them to add a new http header to facilitate content
>> caching?
>>
>> And anyway, the cache system can do de-duplication on it's own, you
>> can freely choose to hash the content and try to combine several cache
>> entries.
>
> There are several benefits: If caching within an ISP is used then the
> content can be located much closer to the clients as it is possible from
> outside the ISP. On the side of the clients this means faster downloads and
> better QoE. ISPs can profit from reduced Interdomain traffic (which is often
> costly) and content producer can improve their QoS and can make use of a
> natural load balancing mechanism.

I know what you are proposing, I'm just skeptical that this will work.
Is only useful for big files, the performance and behavior of
ISP-provided caches has been terrible in my experience, and big
players will very much prefer to control their delivery on their own.

For youtube for example is not uncommon to skip video chunks and they
probably measure that.

>> Preimage attacks are the easier ones, and here we have hashes of file
>> contents. I can grab a js file from google, and try to find a preimage
>> collision suffixing js code until I found one. There has been a lot of
>> advances on this: https://eprint.iacr.org/2009/479.pdf
>
> As the authors conclude in this document SHA256 is secure and you will
> currently not find preimage collisions (in acceptable time). I've checked
> the literature for other publications (e.g. have a look at [2]) but the
> result is always the same: no preimage collisions in acceptable time.
>
> [2]
> Dmitry Khovratovich, Christian Rechberger and Alexandra Savelieva (2011).
> "Bicliques for Preimages: Attacks on Skein-512 and the SHA-2 family". IACR
> Cryptology ePrint Archive. 2011:286

For now... if you want to propose this as a standard you must provide
a mechanism for changing and adapting the used hash for the future.

Some years back, MD5 was a fine hash, imagine how silly would be a RCF
asking to use specifically MD5 for hashing contents.

>> So if you want to safely use a hash cross-domain you will need to
>> avoid preimage, because even sha256 that is currently highly resistant
>> will become eventually weaker.
>
> If SHA256 will become weaker then a new hash algorithm must be used for the
> Cache-NT header (e.g. one of the SHA3 family).

So, as I said, provide them that mechanism. You can't hardcode a hash
in a standard. When sha256 becomes broken we will need Cache-NT2?

>>> One security concern is that an origin server sends a hash value that
>>> does
>>> not fit to the representation in the body of the response message (by
>>> mistake or intention). Then the client will get a different body, if the
>>> cache system has an cache item which fits to the hash value in the
>>> Cache-NT
>>> header of the response from the origin server. I think this isn't a
>>> drawback
>>> of my proposed caching mechanism - I think this is a problem that we have
>>> already today: If the origin server is compromised (or intermediates in
>>> between) the clients would get malicious content already today.
>>
>> The cache boxes will need to check the validity of the hash to be able
>> to serve safely.
>
> As long as SHA256 fulfill the requirements of an cryptographic hash
> algorithm then there is no need to check the validity of the hash. Otherwise
> a new hash algorithm must be used (see above).

Yes, the cache boxes need to check the hashes the first time as you
mention below.

>> Imagine:
>> - google.com/some.js sha256=xxxx
>> - You could trick the user to visit first badserver.com/malicious.js
>> and serve it with sha256=xxx (same as google)
>> - The cache system stores malicious.js as content for sha256 xxx.
>> - user goes to google.com, but the some.js will contain malicious.js
>> instead.
>
> No, that will not work. The cache system computes the hash value again and
> compares it to the hash value in the Cache-NT header. Only when both are
> equal the cache system will reuse the content for following requests.

That's what I was saying. The Cache system needs to compute the hash.
Then as long as you trust the hash you can make the cache system not
need to download the file again from a new source, but again I see
little benefit on all this complexity. The latency is still there
because you go back to origin to ask for the hash, so doesn't help for
low-latency needs and will only help for big files.

>> People serving big files a lot of times will get caching right, believe
>> me.
>
> I don't think so. Have look at youtube traffic and how they exclude shared
> caches in operator networks.

That's probably because they choose to do so for content policies.
They have to pay royalties (so they need to measure what users see,
even what parts of a video) and they do have their distribution
network already, so they can deliver the content with low cost
already. Why trust the infraestructure they can't control and tune?

>> The way this is working currently with providers is that content
>> providers with enough needs place their own boxes in the internet
>> providers, close to the users and deliver content from there.
>
> The problem is not that they serve the content from their boxes, the problem
> is that they disable caching. It's better to serve popular contents from
> within the ISP network (reduced Interdomain traffic, faster downloads,
> better QoS/QoE)

Again, they do that on purpose, I highly doubt they will add this
Cache-NT header when they are explicitely disabling caches for their
content.

--
Guille -bisho-
<bisho@freedreams.org|fb.com>
:wq
Received on Monday, 2 June 2014 19:18:30 UTC