Interaction of CSP and IRIs from Boris Zbarsky on 2012-09-06 (public-webappsec@w3.org from September 2012)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Thu, 06 Sep 2012 17:39:55 -0400
To: public-webappsec@w3.org
Message-ID: <5049182B.5030700@mit.edu>

Dear all,

I was just reading through the CSP draft, and I'm very concerned by the 
handling of non-ASCII characters in CSP.  Specifically, I'm concerned 
about four things:

A)  Lack of description for how one goes from an IRI or partial IRI to a
     host-source expression.
B)  Lack of description for how one compares a source expression to an
     IRI.
C)  Lack of description for how one goes from a Unicode string to
     policy.
D)  The fact that the current setup is likely to cause interop problems.

As far as I can tell, the current setup is as follows:

1)  All CSP policies are made up of bytes in the ASCII range (and in 
particular, a subset of that range).  Non-ASCII hostnames are expected 
to be encoded as punycode, I guess (though this is not actually stated 
anywhere; see concern A above).  Non-ASCII characters in paths 
presumably expected to be %-encoded, but the specification doesn't say 
what encoding should be used for this (concern A again).  In practice, 
by the way, at least one implementation allows non-ASCII bytes in paths, 
though I think the spec is pretty clear that as things stand this is not 
allowed.

2)  When comparing a source expression to an IRI, the IRI needs to first 
be converted to a URI, presumably per RFC 3987.  If the presumption is 
correct, this should probably be explicitly called out (concern B above).

3)  When converting a Unicode string to a policy, presumably one does it 
by taking the numeric value of each codepoint and treating it as an 
ASCII character index?  If so, this should be explicitly called out 
(concern C above).

In practice, I expect people to just call their favorite escape() method 
on their strings if they have to shoehorn them into an ASCII format, 
which means that we'll get a mix of %-encoding in as ISO-8859-1 and 
UTF-8 at the very least, and very possibly others.  The result will be 
lack of interop (concern D).

It seems to me that a lot of these problems were alleviated if CSP 
policies were defined as sequences of Unicode codepoints, with a 
comparison function to IRIs.  The spec would also need to define how to 
construct such a sequence of Unicode codepoints from a 
Content-Security-Policy HTTP header or a 
Content-Security-Policy-Report-Only HTTP header, but the result would be 
to allow authors to use strings that actually make sense to them in CSP 
policies instead of shoehorning them into an ASCII-only format in 
likely-broken ways.

Thank you for taking the time to read all that,
Boris

Received on Thursday, 6 September 2012 21:40:24 UTC