Re: Interaction of CSP and IRIs

On 9/6/12 6:57 PM, Adam Barth wrote:
> CSP 1.0 operates in terms of URIs, not IRIs

Yes, but the rest of the world operates on IRIs, which gives you an 
impedance mismatch.

>> A)  Lack of description for how one goes from an IRI or partial IRI to a
>>      host-source expression.

Really?  Authors never have to go from something that makes sense to 
them (an IRI, which has their non-ASCII hostname and non-ASCII path in 
it) to a host-source expression?

I have a really hard time believing that.  Do only authors of sites with 
ASCII hostnames have to write host-source expressions?  Do authors never 
have to write host-source expressions?  Am I just totally missing something?

>> B)  Lack of description for how one compares a source expression to an
>>      IRI.
>
> That never occurs.

Why not?  Everything else a browser has lying around (e.g. document 
locations) is IRIs.  Are host-source expressions never compared to 
document locations?

>> C)  Lack of description for how one goes from a Unicode string to
>>      policy.
>
> That never occurs.

It sure will for policies in <meta> tags, which I've seen people trying 
to implement.  Sounds like that's 1.1?

>> Non-ASCII hostnames are expected to be
>> encoded as punycode, I guess (though this is not actually stated anywhere;
>> see concern A above).
>
> We're operating in terms of URIs, so the notion of "punycode" doesn't occur.

It occurs as soon as someone has to produce a URI.

>> Non-ASCII characters in paths presumably expected to
>> be %-encoded, but the specification doesn't say what encoding should be used
>> for this (concern A again).
>
> Are your comments about CSP 1.0 or 1.1?  We don't do anything with
> paths in CSP 1.0.

I was reading the editor's draft Google found, which seems to be 1.1. 
For 1.0, looks like there's no path stuff in the syntax, indeed.

>> In practice, by the way, at least one
>> implementation allows non-ASCII bytes in paths, though I think the spec is
>> pretty clear that as things stand this is not allowed.
>
> Would you be willing to contribute a test case to that effect so we
> can catch these sorts of bugs?

I can try.  What format do you take tests in?

>> 2)  When comparing a source expression to an IRI, the IRI needs to first be
>> converted to a URI, presumably per RFC 3987.
>
> We never compare a source expression to an IRI.  We only compare
> source expressions to URIs.

"You" don't, but UAs have to.  Which brings me back to what I said about 
having to convert the IRI to a URI.

>> If the presumption is correct,
>> this should probably be explicitly called out (concern B above).
>
> Defining how a user agent ought to translate an IRI to a URI is
> outside the scope of this document.

OK, so any specification that actually wants to hook into CSP needs to 
define this, right?

I can live with that as long as those other specifications actually do.

>> 3)  When converting a Unicode string to a policy, presumably one does it by
>> taking the numeric value of each codepoint and treating it as an ASCII
>> character index?  If so, this should be explicitly called out (concern C
>> above).
>
> Are your comments about CSP 1.0 or 1.1?  We don't ever convert a
> Unicode string to a policy in CSP 1.0.  We do that in CSP 1.1, and I
> agree that we should add some further explanation of how to do that to
> 1.1.

See above about which draft I was reading.

>> In practice, I expect people to just call their favorite escape() method on
>> their strings if they have to shoehorn them into an ASCII format, which
>> means that we'll get a mix of %-encoding in as ISO-8859-1 and UTF-8 at the
>> very least, and very possibly others.  The result will be lack of interop
>> (concern D).
>
> I'm not sure how to respond to this statement.  Presumably they'll do
> what the CSP 1.1 specification says to do.

I can 100% guarantee that a majority of people writing CSP policies 
won't read the specification.  They'll look up some examples, then just 
write something.

Note that the above paragraph was no longer talking about the 
"converting a Unicode string to a policy" bit.  It was just talking 
about how authors are supposed to create these policy byte arrays.

>> It seems to me that a lot of these problems were alleviated if CSP policies
>> were defined as sequences of Unicode codepoints, with a comparison function
>> to IRIs.
>
> I disagree.  IRIs are an interop can or worms.  We can paper over the
> problem in various ways.  The way we're currently papering over the
> problem is to ignore IRIs entirely and work only with URIs.

What that seems to do is just push the problem out elsewhere at best. 
At worst it means that this will all only be usable for ASCII hostnames 
and ASCII paths...

Is there any resource I can read to see what the problems with IRIs are 
in practice?  It's possible that they're worse than what I'm worried 
about here, but they'd have to be pretty bad.

> in CSP 1.0, the only way to author policies is via HTTP headers, which
> are not defined in terms of Unicode

While true, that doesn't prevent people from sending random non-ASCII 
bytes in HTTP headers, as you well know...

What's the general timeframe on 1.1?

-Boris

Received on Friday, 7 September 2012 01:19:33 UTC