RE: CSP script hashes

This sounds good - but the point that Mountie raised about UTF-8's not being suitable or common for some East Asian languages is important.

My main concern in suggesting a UTF-8 only requirement was to avoid any issues (security, performance, etc.) around the content-encoding sniffing and re-parsing rules.  Perhaps this could be adequately addressed by just requiring an explicit charset in the Content-Type HTTP header or (slightly weaker against injections) as a <meta> in the <head>.

-Brad

From: Bryan McQuade [mailto:bmcquade@google.com]
Sent: Tuesday, February 12, 2013 4:56 PM
To: Hill, Brad
Cc: Ian Melven; Jacob Hoffman-Andrews; Eric Chen; Nicholas Green; public-webappsec@w3.org; Yoav Weiss
Subject: Re: CSP script hashes

Thanks Brad. I agree very much with your summary and your points, especially being aware of not designing something that is brittle going forward with respect to computing hashes in the client. My intent is for us to come up with a basic proposal, then speak with browser implementors to get feedback on the feasibility of implementing that proposal.

I do not think we can realistically expect each UA to be able to compute hashes of inline script blocks in the document, with the document in its original encoding. The tokenization, tree construction, etc subsystems almost certainly all expect the document to have been converted to a single well known character encoding (likely UTF-8 or UTF-16/UCS-2).

I like your suggestion to restrict analysis to UTF-8, but perhaps instead of requiring the document to be UTF-8 encoded when served from the origin, we instead require that the process for computing hashes of inline blocks on the server, as part of the process of constructing the contents of the CSP header, goes something like this:
1. identify the allowed inline blocks in the document
2. convert each inline block's contents to UTF-8
3. compute the hash of the UTF-8 encoded block
4. serve the original response in its native encoding, whatever the content author chose, but send the content hashes of the UTF-8 encoded blocks

I believe we can expect the client to be capable of converting to UTF-8 and computing the hashes in the same way.

This does violate the priority of constituencies as you note (we are putting implementors before authors here, adding complexity to the work authors have to do to generate these hashes) but I think it is the right tradeoff given the constraints of this specific problem. For authors using utf-8, no additional work is required.

My biggest open concern with this approach is in verifying that there is a single canonical way to convert any given character stream into a UTF-8 byte stream. If there is more than one way to encode a given character stream into UTF-8 and there is not a clear canonical encoding, then this is clearly problematic. I'm going to speak with some encoding experts about this but if anyone on list happens to know, that'd save me some time. This page suggests that there is one canonical way to represent any given character stream as a UTF-8 byte stream, which is promising: http://stackoverflow.com/questions/4166094/can-i-get-a-single-canonical-utf-8-string-from-a-unicode-string.

What do you think of this potential approach? I believe it does not introduce brittleness in user agent implementations as it should be very reasonable to expect each UA to be capable of converting the contents of script blocks to UTF-8. This conversion would only be necessary if the CSP header includes one or more hashes for inline scripts/styles.

On Tue, Feb 12, 2013 at 3:20 PM, Hill, Brad <bhill@paypal-inc.com<mailto:bhill@paypal-inc.com>> wrote:
> what is the rationale for preventing this beyond difficulty of implementation?
[Hill, Brad] I'm always the first one to invoke the priority of constituencies, but I think there's a real sense in which difficulty of implementation is the only interesting problem here, and directly related to the use-case goals of the feature.

How do we create a canonical set of bytes to represent script content inline in an HTML document that is unambiguous and yet not brittle across multiple implementations and (importantly) future implementations?

We're taking dependencies on a core and complex part of HTML here.   We should expect HTML to continue to evolve, and for the pressures on it to be stronger than any back-pressure we can put it on behalf of script-hash.

If we design something that is brittle, constrictive or otherwise problematic in the face of the evolution of core document parsing, we should expect script-nonce will fail and get left behind.

Received on Wednesday, 13 February 2013 01:08:33 UTC