[whatwg] Subresource Integrity-based caching from Alex Jordan on 2017-03-02 (public-whatwg-archive@w3.org from March 2017)

From: Alex Jordan <alex@strugee.net>
Date: Wed, 1 Mar 2017 20:59:32 -0500
To: whatwg@whatwg.org
Message-ID: <20170302015932.GG12470@Alexs-MacBook-Pro>

Heya!

So recently I've been thinking about caching on the web and think I've
come up with a pretty neat trick to improve things. However before I
go file a bunch of bugs against browsers I thought it prudent to get
feedback from spec folks.

Here's the basic problem: say I want to include jQuery in a page. I
have two options: host it myself, or use a CDN. If I host it myself,
then I don't get caching benefits for first-time visitors because they
(obviously) haven't visited my page and requested jQuery from my
domain before. Using a sufficiently widespread CDN will fix this for
me, because the more widespread the CDN is, the more likely the user
is to have encountered a page using that CDN. However, this is
somewhat problematic because it leaks data to the CDN operator.

The fundamental issue is that there isn't a direct correspondence to
what a resource's _address_ is and what the resource _itself_ is. In
other words, jQuery 2.0.0 on my domain and jQuery 2.0.0 on the Google
CDN are the exact same resource in terms of content, but are
considered different because they have different addresses.

Here's the proposal: when browsers encounter a <script> tag, etc. with
an integrity= attribute, they try to find a resource in their cache
that matches the specified hash. If one is found, it is used
regardless of the domain it originated from (and thus was cached
for).

Some notes:

1. This is very similar to existing cache-busting techniques employed
by websites today, just baked into the browser.

2. This could potentially be a carrot used to encourage adoption of
Subresource Integrity, because it confers a significant performance
benefit.

3. This sidesteps existing HTTP caching and will probably
ignore/violate some HTTP caching semantics. That's okay, though,
because the fact that it's based on a hash guarantees that the cache
matches what would've been sent over the network - if these were
different, the hash wouldn't match and the mechanism wouldn't kick in.

4. In cases where the integrity= attribute matches some resource in
the user's cache, but not what would normally be returned from the
server, the request will succeed where it otherwise would have
failed. I don't _think_ this is a problem but it *is* technically a
possible fingerprinting vector. The risks are similar to those
associated with intermediary CA caching, which is already shipping and
AFAIK is considered an acceptable risk.

Anyway, this email is long enough already but I'd love to hear
thoughts about things I've missed, etc.

Cheers!

AJ

Received on Thursday, 2 March 2017 02:00:09 UTC