Re: Interaction of CSP and IRIs

On Thu, Sep 6, 2012 at 6:19 PM, Boris Zbarsky <bzbarsky@mit.edu> wrote:
> On 9/6/12 6:57 PM, Adam Barth wrote:
>> CSP 1.0 operates in terms of URIs, not IRIs
>
> Yes, but the rest of the world operates on IRIs, which gives you an
> impedance mismatch.

HTTP operates in terms of URIs.

>>> A)  Lack of description for how one goes from an IRI or partial IRI to a
>>>      host-source expression.
>
> Really?  Authors never have to go from something that makes sense to them
> (an IRI, which has their non-ASCII hostname and non-ASCII path in it) to a
> host-source expression?
>
> I have a really hard time believing that.  Do only authors of sites with
> ASCII hostnames have to write host-source expressions?  Do authors never
> have to write host-source expressions?  Am I just totally missing something?

I'm not sure I understand your question.  Authors deal with
host-expressions the same way they deal with the HTTP Host header.

>>> B)  Lack of description for how one compares a source expression to an
>>>      IRI.
>>
>> That never occurs.
>
> Why not?  Everything else a browser has lying around (e.g. document
> locations) is IRIs.  Are host-source expressions never compared to document
> locations?

In the end, the browser needs to translate IRIs into URIs for use in
HTTP.  Everything in CSP 1.0 is defined in terms of networking
operations, and therefore can be defined in the same terms as HTTP.
For CSP 1.1, the situation gets a bit more complicated, but we can
defer worrying about that until then.

>>> C)  Lack of description for how one goes from a Unicode string to
>>>      policy.
>>
>> That never occurs.
>
> It sure will for policies in <meta> tags, which I've seen people trying to
> implement.  Sounds like that's 1.1?

Correct.

>>> Non-ASCII hostnames are expected to be
>>> encoded as punycode, I guess (though this is not actually stated
>>> anywhere;
>>> see concern A above).
>>
>> We're operating in terms of URIs, so the notion of "punycode" doesn't
>> occur.
>
> It occurs as soon as someone has to produce a URI.

Indeed, but that's outside the scope of CSP 1.0.

>>> Non-ASCII characters in paths presumably expected to
>>> be %-encoded, but the specification doesn't say what encoding should be
>>> used
>>> for this (concern A again).
>>
>> Are your comments about CSP 1.0 or 1.1?  We don't do anything with
>> paths in CSP 1.0.
>
> I was reading the editor's draft Google found, which seems to be 1.1. For
> 1.0, looks like there's no path stuff in the syntax, indeed.
>
>>> In practice, by the way, at least one
>>> implementation allows non-ASCII bytes in paths, though I think the spec
>>> is
>>> pretty clear that as things stand this is not allowed.
>>
>> Would you be willing to contribute a test case to that effect so we
>> can catch these sorts of bugs?
>
> I can try.  What format do you take tests in?

Actually, if your issue is with the WebKit implementation, you can
just file a bug and I'll write a test in the course of fixing it.

>>> 2)  When comparing a source expression to an IRI, the IRI needs to first
>>> be
>>> converted to a URI, presumably per RFC 3987.
>>
>>
>> We never compare a source expression to an IRI.  We only compare
>> source expressions to URIs.
>
> "You" don't, but UAs have to.  Which brings me back to what I said about
> having to convert the IRI to a URI.

We don't define that in the specification.  I'd recommend using the
same algorithm you use for converting IRIs to URIs that you use for
HTTP.  Yes, that doesn't interoperate between browsers, but that's not
a problem we're going to solve here.

>>> If the presumption is correct,
>>> this should probably be explicitly called out (concern B above).
>>
>> Defining how a user agent ought to translate an IRI to a URI is
>> outside the scope of this document.
>
> OK, so any specification that actually wants to hook into CSP needs to
> define this, right?
>
> I can live with that as long as those other specifications actually do.

Converting IRIs to URIs is a big mess.  It's something that needs to
be solved for the whole platform.  I tried to solve it a couple years
ago, but gave up.

>>> 3)  When converting a Unicode string to a policy, presumably one does it
>>> by
>>> taking the numeric value of each codepoint and treating it as an ASCII
>>> character index?  If so, this should be explicitly called out (concern C
>>> above).
>>
>> Are your comments about CSP 1.0 or 1.1?  We don't ever convert a
>> Unicode string to a policy in CSP 1.0.  We do that in CSP 1.1, and I
>> agree that we should add some further explanation of how to do that to
>> 1.1.
>
> See above about which draft I was reading.
>
>>> In practice, I expect people to just call their favorite escape() method
>>> on
>>> their strings if they have to shoehorn them into an ASCII format, which
>>> means that we'll get a mix of %-encoding in as ISO-8859-1 and UTF-8 at
>>> the
>>> very least, and very possibly others.  The result will be lack of interop
>>> (concern D).
>>
>> I'm not sure how to respond to this statement.  Presumably they'll do
>> what the CSP 1.1 specification says to do.
>
> I can 100% guarantee that a majority of people writing CSP policies won't
> read the specification.  They'll look up some examples, then just write
> something.
>
> Note that the above paragraph was no longer talking about the "converting a
> Unicode string to a policy" bit.  It was just talking about how authors are
> supposed to create these policy byte arrays.
>
>>> It seems to me that a lot of these problems were alleviated if CSP
>>> policies
>>> were defined as sequences of Unicode codepoints, with a comparison
>>> function
>>> to IRIs.
>>
>> I disagree.  IRIs are an interop can or worms.  We can paper over the
>> problem in various ways.  The way we're currently papering over the
>> problem is to ignore IRIs entirely and work only with URIs.
>
> What that seems to do is just push the problem out elsewhere at best. At
> worst it means that this will all only be usable for ASCII hostnames and
> ASCII paths...

This problem is not at all specific to CSP.  It's a serious problem in
the entire platform.  Insert rant about IDNA2008.

> Is there any resource I can read to see what the problems with IRIs are in
> practice?  It's possible that they're worse than what I'm worried about
> here, but they'd have to be pretty bad.

The short version is that the IETF insists that folks use IDNA2008,
but most browsers implement something closer to IDNA2003.  IDNA2008 is
not backwards compatible with IDNA2003 and so will never actually be
deployed.  Any attempts to hammer out a browser-consensus spec get
shouted down by folks who are pushing IDNA2008.

>> in CSP 1.0, the only way to author policies is via HTTP headers, which
>> are not defined in terms of Unicode
>
> While true, that doesn't prevent people from sending random non-ASCII bytes
> in HTTP headers, as you well know...
>
> What's the general timeframe on 1.1?

We're hoping to put out a FPWD immediately after 1.0 goes to CR.
We're currently in the call-for-consensus to advance 1.0 to CR, so
"very soon."  We're actively working on a 1.1 implementation.

Thanks,
Adam

Received on Friday, 7 September 2012 03:49:31 UTC