Re: [Feature Proposal] New attributes "library" and "version" on script tags from Chris Bojarski on 2013-08-11 (public-html@w3.org from August 2013)

From: Chris Bojarski <chris@cbojar.net>
Date: Sat, 10 Aug 2013 21:26:35 -0400
To: public-html@w3.org
Message-ID: <5206E84B.6070206@cbojar.net>
I will try to address some of these objections/alternatives one by one.

First, there really cannot be a system where a file can be downloaded 
from one arbitrary site and be trusted to be used on any other site. 
This is basic cache poisoning. To really trust something, it (or a 
verification of it) would need to come from a trusted source. I 
understand that, if we could do this, it might seem to present some 
benefits, but the weight of the downsides heavily outweigh those. 
Additionally, some benefits are merely illusory.

For example, François' major objection is that the browsers would be 
picking winners and losers with my proposed system, but in truth, a 
high-priority cross-site cache would enable the same effect. Because the 
most popular libraries _are_ popular, they would be encountered the most 
often anyway, and get the same speed-up effect after the first download 
as under my proposal. The less popular libraries would be just as 
penalized under both systems since they would be encountered less 
frequently. For popular libraries, the first site would still be 
penalized with the initial download, and you would introduce the 
significant security concerns on top of that.

Conversely, winners and losers could still be chosen by popular home 
pages (ex. google.com) by loading those scripts in the background as 
soon as the browser opens that home page, but this would occur less 
transparently, and could be more easily perverted toward a particular 
commercial or otherwise non-neutral bias. (example: google.com could 
silently load all the scripts for Google and Gmail and GDrive, making 
their sites look much faster, while Yahoo! Mail would not get this 
benefit just because it's not the home page.)

Second, Nathaneal's suggestion is interesting, but includes its own 
challenges. By declaring any cryptographic encoding, one places an 
immediate expiration date on that standard. Today's uncrackable codes 
are tomorrow's ROT13's, so even with what is today a cryptographically 
strong algorithm, there would still be enormous security concerns.

We'd also have to define "decompressed, encoding-normalized file 
contents." I'll start: tabs or spaces? 4 spaces or 2? CRLF, CR, or LF? 
I'm not saying eventually coming up with such a standard would be 
impossible, but it would be a significant process full of holy wars that 
really don't need to be fought. Then, browsers would have to implement a 
system for normalizing downloaded files to hash them, introducing more, 
and more cumbersome code and overhead into the system, also possibly 
bringing with it new vectors of attack on the browsers. (this is where 
the input of a browser vendor would be helpful to confirm or refute 
this) Then, library developers would have to understand how to create a 
"decompressed, encoding-normalized" version of their files. While large 
library authors might have a grasp of how to do this, smaller library 
authors might find it too complicated or cumbersome. Then, they would 
have to communicate all this to web devs, telling them that to (maybe) 
speed up their pages, they have to include an incomprehensible chunk of 
code gibberish. And of course the "maybe" comes from the fact that web 
devs have to assume their page is not the first to be encountered with 
the script, otherwise there is no benefit to them because it would still 
have to be downloaded just as before. This gets back to the other 
objection about favoritism of popular libraries. I see no reason to fill 
my pages with pseudo-random base-64 strings on the hope that someone 
else takes the download hit more than me.

There was also talk of ensuring universal file names and pushing some 
aspects of this into HTTP. In regards to the HTTP part, if that were to 
be done, it would offer almost no benefit over standard caching, but 
bring with it the normalizing and hashing overhead mentioned above, so 
the point would be moot. It would also involve whole other standards 
changes that would become complicated, and would require changes to HTTP 
servers, something system administrators would be loathe to do. As to 
universal file names, I've seen: jquery.js, jquery.min.js, 
jquery-1.8.1.js jquery-1.8.1.min.js, and the ever popular script.js all 
used to refer to the same file. People don't like that kind of 
constriction; if they did, we could just use the file names already to 
accomplish this.

I appreciate all the feedback from you guys, and I'm glad that you're 
looking at this from all sorts of angles I hadn't considered. I see two 
major concerns emerging: security (which I expected, though being taken 
in directions I didn't expect), and favoritism/neutrality (which I also 
expected, and want to get more feedback about since this is a 
particularly hard wall). Security will be an ongoing concern any time 
anything is shared between two objects, and favoritism can be hard to 
negate, as sometimes fighting it can result in more/different 
favoritism, or other unintended consequences. Transparency, of course, 
can help address both issues, but we need to make sure we build a good 
house on a good foundation. And the only way we get there is to have 
smarter people than I keep talking. :)

-Chris.

On 08/10/2013 07:46 PM, François REMY wrote:
>> A 512-bit SHA-2 hash cannot have conflicts.
> Wait, what? This is absurd. Every hash system has conflicts, by definition. You can even calculate how many conflicts there exist: you simply divide the number of possible files of a certain length by the number of possible hashes. Let's suppose all JS files are exactly 64 kilobytes long, there are exactly 1000 files that share the same hash. Arguably, the probability that those files are javascript files is very low (most of them will be garbage) but you cannot base your system on something that's only based on an hash, it doesn't make sense.
>
> The point of the hash is not the identify the resource, but to make sure no attacker could actually poison the cache by sending a fake file with the same name as a popular library to execute code on other websites (hash are secure in the sense it's hard to find another of the 999 other files having the same hash, let alone creating another file that actually is valid javascript and has the same length and hash, this is generally impossible).
>
>
>
>> Filename matching would make the feature unreliable, adding another point of failure.
> I don't see how it makes the feature unreliable. DLLs are loaded based on filename on all OSes and I don't think it has ever been an issue... By the way, .NET use exactly the combination of filename+guid to identify DLLs and versions.
>
>
>
>> There's also no point this feature if you move it to the HTTP layer; it only provides a benefit if (a) eliminates network traffic and (b) securely identifies a file across multiple websites so (c) browsers can implement higher-order caching.
> Not true. Moving the feature to HTTP does not remove (b) and (c) benefits, and conserve most of the (a) benefits. It also add the possibility to accept multiple versions of a same file on the server side, and to update this in real time as new versions get supported, without modifying all the pages.
>
>
>
>> Think about how *trivial* it is to implement a text editor plugin that updates these hashes.
> Not an argument.
Received on Sunday, 11 August 2013 01:27:03 UTC