- From: Nathanael D. Jones <nathanael.jones@gmail.com>
- Date: Sun, 11 Aug 2013 17:41:54 -0400
- To: François REMY <francois.remy.dev@outlook.com>
- Cc: "Patrick H. Lauke" <redux@splintered.co.uk>, HTML WG LIST <public-html@w3.org>, Glenn Adams <glenn@skynav.com>
- Message-ID: <CAG3DbfVtKz0psS7f_wFQ-r7y4Xr1J7xGmuh-OK3o7L+3HPkzwQ@mail.gmail.com>
Francois, please do *research* before spreading FUD about hashes; they're already poorly understood by the general public. * * *TLDR; We'd have to store 1 trillion petabytes per atom on earth to have a 1 in a trillion chance at a random collision in a 512-bit space. Any other identifier is redundant; pointless; error-prone. *A 256-bit SHA-2 hash is also *just fine, *and should be safe from cache poisoning attempts for a very, very long time. As a 256-bit hash is 32 bytes instead of 64, it may make sense to support both. * * Since this is an opt-in feature, high-security sites can always choose to not use it.... but they'd have to be rather superstitious to believe that the first SHA-2 hash collision would be used to exploit their site instead of achieving worldwide fame. If a fundamental weakness is ever found in SHA-2, browsers can simply disable the optimization. Not hard, and easily versioned. There are a lot of good syntaxes available: script sha2-256="value" sha2-512="value" sha3-512="value" script hash2="(256-bit SHA-2 hash)" hash3="(512-bit SHA-2 hash)" hash4="(512-bit SHA-3 hash)" script hash="sha2-256:base64uhash;sha2-512;base64uhash" ----- >> Wait, what? This is absurd. Every hash system has conflicts, by definition. You can even calculate how many conflicts there exist: you simply divide the number of possible files of a certain length by the number of possible hashes. Let's suppose all JS files are exactly 64 kilobytes long, there are exactly 1000 files that share the same hash. Arguably, the probability that those files are javascript files is very low (most of them will be garbage) but you cannot base your system on something that's only based on an hash, it doesn't make sense. *False*. There are not 1000 files that share that hash; there are potentially 1,000 64kb sequences of binary data that may produce the same hash, but those sequences do *NOT* exist in in real world, and *cannot be intentionally discovered in a trillion years of linear computation, even with massive resources.* Let's review how many files can be uniquely identified by a 512-bit SHA2 hash: (1.3 x10^154) *13407807929942597099574024998205846127479365820592393377723561443721764030073546976801874298166903427690031858186486050853753882811946569946433649006084096 * * * And a 256-bit SHA2 hash (1.2 x 10^77) * 115792089237316195423570985008687907853269984665640564039457584007913129639936 * Google has crawled 1 trillion unique URLs (10 ^ 12): *1000000000000* Even if all of those URLs have 1,000 sub-resources (10 ^ 15) *1000000000000000* * * Let's try to prove your point with the birthday paradox ( http://en.wikipedia.org/wiki/Birthday_attack). We can estimate the number of files that can safely be handled by a given hash algorithm size in bits (b) and an acceptable probability ratio (p) using [image: p(n;H) \approx 1 - e^{-n(n-1)/(2H)} \approx 1-e^{-n^2/(2H)}, \,] http://www.wolframalpha.com/input/?i=p%3D+1-e%5E%28-n%5E2%2F%282%282%5Eb%29%29%29%3B+b%3D512%3B+p%3D10%5E-12 The result for a 512-bit hash is 1.6 x 10^71 *163617000000000000000000000000000000000000000000000000000000000000000000* In a best/worse case scenario, you're wrong by 55 orders of magnitude; plenty of room for the internet to grow. *We'd have to store 1 trillion petabytes per atom on earth to have a 1 in trillion chance at a random collision.* For a 256-bit hash http://www.wolframalpha.com/input/?i=p%3D+1-e%5E%28-n%5E2%2F%282%282%5Eb%29%29%29%3B+b%3D256%3B+p%3D10%5E-12 We get 4.80826×10^32, * * *480826000000000000000000000000000* Leaving enough room for every current page on the internet to be replaced by 5 zettabytes of hashed content. ---- We can further explore to see what size hash could safely be used on the internet while keeping the risk of a single random collision under (1/1trillion). http://www.wolframalpha.com/input/?i=p%3D+1-e%5E%28-n%5E2%2F%282%282%5Eb%29%29%29%3B+n%3D10%5E15%3B+p%3D10%5E-12 Which we find is 139 bits. So if our only enemy is random chance (and not a intentional attack), a 140-bit hash is just fine. Our only reason to use a 512-bit hash is to prevent intentional cache poisoning for the next billion years. Nobody has found a collision in SHA-2, even at 224 bits. Nor have they in SHA-1, although it has a theoretical weakness of 2^60 at 128 bits. http://en.wikipedia.org/wiki/SHA-2 > The point of the hash is not the identify the resource, but to make sure no attacker could actually poison the cache by sending a fake file with the same name as a popular library to execute code on other websites (hash are secure in the sense it's hard to find another of the 999 other files having the same hash, let alone creating another file that actually is valid javascript and has the same length and hash, this is generally impossible). Why on earth would you need something other than a hash for identification of a resource? How could something else possibly be more reliable? > > Filename matching would make the feature unreliable, adding another point of failure. > I don't see how it makes the feature unreliable. DLLs are loaded based on filename on all OSes and I don't think it has ever been an issue... By the way, .NET use exactly the combination of filename+guid to identify DLLs and versions. Don't let humans mess up something computers can handle. If you don't think DLL hell exists, you should talk to someone in software. .NET uses hashes, not GUIDs, those are used in COM interop. > > There's also no point this feature if you move it to the HTTP layer; it only provides a benefit if (a) eliminates network traffic and (b) securely identifies a file across multiple websites so (c) browsers can implement higher-order caching. > Not true. Moving the feature to HTTP does not remove (b) and (c) benefits, and conserve most of the (a) benefits. It also add the possibility to accept multiple versions of a same file on the server side, and to update this in real time as new versions get supported, without modifying all the pages. If the URIs differ, it can't go into HTTP. > > Think about how *trivial* it is to implement a text editor plugin that updates these hashes. > Not an argument. Reducing room for error and enabling automation is *absolutely *a good argument for doing anything. On Sun, Aug 11, 2013 at 8:19 AM, Glenn Adams <glenn@skynav.com> wrote: > I haven't noticed anyone mention CSP 1.1 and the nonce attribute yet [1]. > Sounds like folks (talking about security matters) should be looking there > first instead of inventing something else. Though I admit that CSP is not > explicitly dealing with caching semantics as such, but fetching semantics. > > [1] > https://dvcs.w3.org/hg/content-security-policy/raw-file/tip/csp-specification.dev.html#usage-4 > > > On Sun, Aug 11, 2013 at 1:07 AM, Patrick H. Lauke <redux@splintered.co.uk>wrote: > >> On 11/08/2013 07:38, Andrew Herrington wrote: >> >>> Could this also be used with CSS (twitter bootstrap) and web component >>> imports (twitter / Facebook feeds)? >>> >> >> And bitmap images, SVG files, anything really? >> >> On first reading, there are many aspects I'm personally not a fan of >> (browsers shipping with JS libraries pre-seeded, and the emphasis on just >> JS), but as a more generalised principle, I could imagine something along >> the lines of: >> >> - no pre-seeding >> - every asset downloaded and cached by the browser gets some form of >> hash/checksum/digital fingerprint (leaving the discussion of how to do this >> effectively without clashes aside for a minute) >> - as part of the request to the server, the browser also receives a >> hash/checksum for the file being sent as part of the initial connection >> negotiation and/or head request >> - if the browser thinks that a file that is about to be downloaded is >> already present in its own cache (hash/checksum matches, expire headers all >> ok, etc), it uses its cached version rather than carry on with downloading >> (regardless of origin?) >> >> P >> -- >> Patrick H. Lauke >> ______________________________**______________________________**__ >> re·dux (adj.): brought back; returned. used postpositively >> [latin : re-, re- + dux, leader; see duke.] >> >> www.splintered.co.uk | www.photographia.co.uk >> http://redux.deviantart.com | http://flickr.com/photos/**redux/<http://flickr.com/photos/redux/> >> ______________________________**______________________________**__ >> twitter: @patrick_h_lauke | skype: patrick_h_lauke >> ______________________________**______________________________**__ >> >> >
Received on Sunday, 11 August 2013 21:42:43 UTC