- From: Brian Smith <brian@briansmith.org>
- Date: Wed, 11 Feb 2015 03:10:25 -0800
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- Cc: Mike West <mkwst@google.com>, "public-webappsec@w3.org" <public-webappsec@w3.org>, Brad Hill <hillbrad@gmail.com>, Dan Veditz <dveditz@mozilla.com>, Wendy Seltzer <wseltzer@w3.org>
Bjoern Hoehrmann <derhoermi@gmx.net> wrote: >>> * Recommend that folks %-encode unicode characters when delivered as an HTTP >>> header >> >>Not just %-encoded, but convert the IRI to a URI. In particular, >>punycode should be used for the domain labels in the authority, and >>the path and query string should be converted to UTF-8 and then >>normalized and URL-encoded. > > I am intimiately familiar with the relevant standards here, but I don't > really understand your comment. Could you take a step back and describe > the problems you see? Some things to note: > > * HTTP generally does not use "non-ASCII octets" in headers > * host names in URIs can use UTF-8+%xx-encoding > * CSP uses bare host names in some protocol elements > * urlencode(normalize(utf8encode(...))) is most probably wrong, > whatever that is trying to do. Thanks for offering to help. I am going to use IETF terminology since I think you and I are both most familiar with that and it is less verbose than other alternatives. Basically, in CSP, anywhere a URI or URI reference is accepted, I want CSP to accept IRIs to the same extent that HTML supports IRIs. This seems very straightforward for <meta> CSP, and possible but problematic for CSP in HTTP header fields. As you know, there are a lot of reasons why it is better to keep HTTP header field values as pure ASCII, so there needs to be a way to specify any IRI in an ASCII encoding--i.e. IRIs that have been converted to URIs in the CSP policy need to match the same things that the native unicode IRI encoding would match. Note that Although hostnames in URIs can use UTF-8+%xx-encoding, the punycode encoding of hostnames must also be accepted. You mentioned that urlencode(normalize(utf8encode(...))) is most probably wrong. However, consider a document that is NOT in UTF-8 encoding, but instead in Shift-JIS. I believe that there does need to be a first step of converting the text to Unicode and then UTF-8 encoding the Unicode text. However, I could very well be wrong here. Here is what the HTTP specification [1] says about the encoding of header fields: Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data. [1] http://tools.ietf.org/html/rfc7230#section-3.2.4 Cheers, Brian
Received on Wednesday, 11 February 2015 11:10:53 UTC