Re: CfC: Transition CSP2 to CR.

Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
>>> * Recommend that folks %-encode unicode characters when delivered as an HTTP
>>> header
>>
>>Not just %-encoded, but convert the IRI to a URI. In particular,
>>punycode should be used for the domain labels in the authority, and
>>the path and query string should be converted to UTF-8 and then
>>normalized and URL-encoded.
>
> I am intimiately familiar with the relevant standards here, but I don't
> really understand your comment. Could you take a step back and describe
> the problems you see? Some things to note:
>
>   * HTTP generally does not use "non-ASCII octets" in headers
>   * host names in URIs can use UTF-8+%xx-encoding
>   * CSP uses bare host names in some protocol elements
>   * urlencode(normalize(utf8encode(...))) is most probably wrong,
>     whatever that is trying to do.

Thanks for offering to help. I am going to use IETF terminology since
I think you and I are both most familiar with that and it is less
verbose than other alternatives.

Basically, in CSP, anywhere a URI or URI reference is accepted, I want
CSP to accept IRIs to the same extent that HTML supports IRIs. This
seems very straightforward for <meta> CSP, and possible but
problematic for CSP in HTTP header fields.

As you know, there are a lot of reasons why it is better to keep HTTP
header field values as pure ASCII, so there needs to be a way to
specify any IRI in an ASCII encoding--i.e. IRIs that have been
converted to URIs in the CSP policy need to match the same things that
the native unicode IRI encoding would match. Note that

Although hostnames in URIs can use UTF-8+%xx-encoding, the punycode
encoding of hostnames must also be accepted.

You mentioned that urlencode(normalize(utf8encode(...))) is most
probably wrong. However, consider a document that is NOT in UTF-8
encoding, but instead in Shift-JIS. I believe that there does need to
be a first step of converting the text to Unicode and then UTF-8
encoding the Unicode text. However, I could very well be wrong here.

Here is what the HTTP specification [1] says about the encoding of
header fields:

   Historically, HTTP has allowed field content with text in the
   ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
   through use of [RFC2047] encoding.  In practice, most HTTP header
   field values use only a subset of the US-ASCII charset [USASCII].
   Newly defined header fields SHOULD limit their field values to
   US-ASCII octets.  A recipient SHOULD treat other octets in field
   content (obs-text) as opaque data.

[1] http://tools.ietf.org/html/rfc7230#section-3.2.4

Cheers,
Brian

Received on Wednesday, 11 February 2015 11:10:53 UTC