Re: [Feature Proposal] New attributes "library" and "version" on script tags from Nathanael D. Jones on 2013-08-11 (public-html@w3.org from August 2013)

From: Nathanael D. Jones <nathanael.jones@gmail.com>
Date: Sun, 11 Aug 2013 17:41:54 -0400
To: François REMY <francois.remy.dev@outlook.com>
Cc: "Patrick H. Lauke" <redux@splintered.co.uk>, HTML WG LIST <public-html@w3.org>, Glenn Adams <glenn@skynav.com>
Message-ID: <CAG3DbfVtKz0psS7f_wFQ-r7y4Xr1J7xGmuh-OK3o7L+3HPkzwQ@mail.gmail.com>
Francois, please do *research* before spreading FUD about hashes; they're
already poorly understood by the general public.
*
*
*TLDR; We'd have to store 1 trillion petabytes per atom on earth to have a
1 in a trillion chance at a random collision in a 512-bit space. Any other
identifier is redundant; pointless; error-prone. *A 256-bit SHA-2 hash is
also *just fine, *and should be safe from cache poisoning attempts for a
very, very long time. As a 256-bit hash is 32 bytes instead of 64, it may
make sense to support both.
*
*
Since this is an opt-in feature, high-security sites can always choose to
not use it.... but they'd have to be rather superstitious to believe that
the first SHA-2 hash collision would be used to exploit their site instead
of achieving worldwide fame.

If a fundamental weakness is ever found in SHA-2, browsers can simply
disable the optimization. Not hard, and easily versioned.

There are a lot of good syntaxes available:

  script sha2-256="value" sha2-512="value" sha3-512="value"

  script hash2="(256-bit SHA-2 hash)" hash3="(512-bit SHA-2 hash)"
hash4="(512-bit SHA-3 hash)"

  script hash="sha2-256:base64uhash;sha2-512;base64uhash"

-----

>> Wait, what? This is absurd. Every hash system has conflicts, by
definition. You can even calculate how many conflicts there exist: you
simply divide the number of possible files of a certain length by the
number of possible hashes. Let's suppose all JS files are exactly 64
kilobytes long, there are exactly 1000 files that share the same hash.
Arguably, the probability that those files are javascript files is very low
(most of them will be garbage) but you cannot base your system on something
that's only based on an hash, it doesn't make sense.

*False*. There are not 1000 files that share that hash; there are
potentially 1,000 64kb sequences of binary data that may produce the same
hash, but those sequences do *NOT* exist in in real world, and *cannot be
intentionally discovered in a trillion years of linear computation, even
with massive resources.*

Let's review how many files can be uniquely identified by a 512-bit SHA2
hash: (1.3 x10^154)

*13407807929942597099574024998205846127479365820592393377723561443721764030073546976801874298166903427690031858186486050853753882811946569946433649006084096
*
*
*
And a 256-bit SHA2 hash (1.2 x 10^77)

*
115792089237316195423570985008687907853269984665640564039457584007913129639936
*

Google has crawled 1 trillion unique URLs (10 ^ 12):

*1000000000000*

Even if all of those URLs have 1,000 sub-resources (10 ^ 15)

*1000000000000000*
*
*
Let's try to prove your point with the birthday paradox (
http://en.wikipedia.org/wiki/Birthday_attack).
We can estimate the number of files that can safely be handled by a given
hash algorithm size in bits (b) and an acceptable probability ratio (p)
using [image: p(n;H) \approx 1 - e^{-n(n-1)/(2H)} \approx 1-e^{-n^2/(2H)},
\,]
http://www.wolframalpha.com/input/?i=p%3D+1-e%5E%28-n%5E2%2F%282%282%5Eb%29%29%29%3B+b%3D512%3B+p%3D10%5E-12

The result for a 512-bit hash is 1.6 x 10^71

*163617000000000000000000000000000000000000000000000000000000000000000000*

In a best/worse case scenario, you're wrong by 55 orders of magnitude;
plenty of room for the internet to grow. *We'd have to store 1 trillion
petabytes per atom on earth to have a 1 in trillion chance at a random
collision.*

For a 256-bit hash
http://www.wolframalpha.com/input/?i=p%3D+1-e%5E%28-n%5E2%2F%282%282%5Eb%29%29%29%3B+b%3D256%3B+p%3D10%5E-12

We get 4.80826×10^32,
*
*
*480826000000000000000000000000000*

Leaving enough room for every current page on the internet to be replaced
by 5 zettabytes of hashed content.

----

We can further explore to see what size hash could safely be used on the
internet while keeping the risk of a single random collision under
(1/1trillion).
http://www.wolframalpha.com/input/?i=p%3D+1-e%5E%28-n%5E2%2F%282%282%5Eb%29%29%29%3B+n%3D10%5E15%3B+p%3D10%5E-12

Which we find is 139 bits. So if our only enemy is random chance (and not a
intentional attack), a 140-bit hash is just fine.

Our only reason to use a 512-bit hash is to prevent intentional cache
poisoning for the next billion years. Nobody has found a collision in
SHA-2, even at 224 bits. Nor have they in SHA-1, although it has a
theoretical weakness of 2^60 at 128 bits. http://en.wikipedia.org/wiki/SHA-2

> The point of the hash is not the identify the resource, but to make sure
no attacker could actually poison the cache by sending a fake file with the
same name as a popular library to execute code on other websites (hash are
secure in the sense it's hard to find another of the 999 other files having
the same hash, let alone creating another file that actually is valid
javascript and has the same length and hash, this is generally impossible).

Why on earth would you need something other than a hash for identification
of a resource? How could something else possibly be more reliable?

> > Filename matching would make the feature unreliable, adding another
point of failure.

> I don't see how it makes the feature unreliable. DLLs are loaded based on
filename on all OSes and I don't think it has ever been an issue... By the
way, .NET use exactly the combination of filename+guid to identify DLLs and
versions.

Don't let humans mess up something computers can handle. If you don't think
DLL hell exists, you should talk to someone in software. .NET uses hashes,
not GUIDs, those are used in COM interop.

> > There's also no point this feature if you move it to the HTTP layer; it
only provides a benefit if (a) eliminates network traffic and (b) securely
identifies a file across multiple websites so (c) browsers can implement
higher-order caching.

> Not true. Moving the feature to HTTP does not remove (b) and (c)
benefits, and conserve most of the (a) benefits. It also add the
possibility to accept multiple versions of a same file on the server side,
and to update this in real time as new versions get supported, without
modifying all the pages.

If the URIs differ, it can't go into HTTP.

> > Think about how *trivial* it is to implement a text editor plugin that
updates these hashes.

> Not an argument.

Reducing room for error and enabling automation is *absolutely *a good
argument for doing anything.



On Sun, Aug 11, 2013 at 8:19 AM, Glenn Adams <glenn@skynav.com> wrote:

> I haven't noticed anyone mention CSP 1.1 and the nonce attribute yet [1].
>  Sounds like folks (talking about security matters) should be looking there
> first instead of inventing something else. Though I admit that CSP is not
> explicitly dealing with caching semantics as such, but fetching semantics.
>
> [1]
> https://dvcs.w3.org/hg/content-security-policy/raw-file/tip/csp-specification.dev.html#usage-4
>
>
> On Sun, Aug 11, 2013 at 1:07 AM, Patrick H. Lauke <redux@splintered.co.uk>wrote:
>
>> On 11/08/2013 07:38, Andrew Herrington wrote:
>>
>>> Could this also be used with CSS (twitter bootstrap) and web component
>>> imports (twitter / Facebook feeds)?
>>>
>>
>> And bitmap images, SVG files, anything really?
>>
>> On first reading, there are many aspects I'm personally not a fan of
>> (browsers shipping with JS libraries pre-seeded, and the emphasis on just
>> JS), but as a more generalised principle, I could imagine something along
>> the lines of:
>>
>> - no pre-seeding
>> - every asset downloaded and cached by the browser gets some form of
>> hash/checksum/digital fingerprint (leaving the discussion of how to do this
>> effectively without clashes aside for a minute)
>> - as part of the request to the server, the browser also receives a
>> hash/checksum for the file being sent as part of the initial connection
>> negotiation and/or head request
>> - if the browser thinks that a file that is about to be downloaded is
>> already present in its own cache (hash/checksum matches, expire headers all
>> ok, etc), it uses its cached version rather than carry on with downloading
>> (regardless of origin?)
>>
>> P
>> --
>> Patrick H. Lauke
>> ______________________________**______________________________**__
>> re·dux (adj.): brought back; returned. used postpositively
>> [latin : re-, re- + dux, leader; see duke.]
>>
>> www.splintered.co.uk | www.photographia.co.uk
>> http://redux.deviantart.com | http://flickr.com/photos/**redux/<http://flickr.com/photos/redux/>
>> ______________________________**______________________________**__
>> twitter: @patrick_h_lauke | skype: patrick_h_lauke
>> ______________________________**______________________________**__
>>
>>
>
Received on Sunday, 11 August 2013 21:42:43 UTC