Re: Unicode in HTTP headers (was Re: json-string for HTTP header field parameter values) from Mark Nottingham on 2011-10-31 (ietf-http-wg@w3.org from October to December 2011)

From: Mark Nottingham <mnot@mnot.net>
Date: Tue, 1 Nov 2011 09:12:04 +1100
To: Sam Johnston <samj@samj.net>
Cc: httpbis Group <ietf-http-wg@w3.org>
Message-Id: <289B9A30-396E-4437-BA2F-40E4C08769E0@mnot.net>
Hi Sam,

See <http://tools.ietf.org/html/draft-ietf-httpbis-p1-messaging-17#section-3.2.1>, which says:

>    Historically, HTTP has allowed field content with text in the ISO-
>    8859-1 [ISO-8859-1] character encoding and supported other character
>    sets only through use of [RFC2047] encoding.  In practice, most HTTP
>    header field values use only a subset of the US-ASCII character
>    encoding [USASCII].  Newly defined header fields SHOULD limit their
>    field values to US-ASCII octets.  Recipients SHOULD treat other (obs-
>    text) octets in field content as opaque data.


The BNF is now:

>      header-field   = field-name ":" OWS field-value BWS
>      field-name     = token
>      field-value    = *( field-content / obs-fold )
>      field-content  = *( HTAB / SP / VCHAR / obs-text )

where 

>      obs-text       = %x80-FF


Basically, it's leaving the door every-so-slightly open for Unicode in the future (by having recipients treat them as bare octets), but cautioning that it may cause interoperability problems (in particular, libraries assuming it's ASCII or Latin-1, thereby making the Unicode unavailable). Practically speaking, if you're creating a new HTTP header that you want to interoperate today, you'll use ASCII.

Cheers,


On 31/10/2011, at 11:41 PM, Sam Johnston wrote:

> On one hand I agree with James' observation that "neither supports Unicode, which isn't really acceptable today" — while I'm no i18n expert, having lived in Europe for almost a decade (prior to which I, among other things, also worked at Telstra) it is still infuriating and often dangerous when sites and applications inexplicably garble names, addresses, text, etc. This is of such importance that I even suggested a version bump could be justified a few years ago (http://lists.w3.org/Archives/Public/ietf-http-wg/2009JulSep/0842.html), prior to my "year of darkness" at Google.
> 
> On the other hand I think putting JSON anywhere near the headers is a terrible, horrible, no good, very bad idea — even if constrained to the string syntax as suggested, among other things for reasons Mark has already covered — if anything it belongs as a character set (http://www.iana.org/assignments/character-sets) for use with RFC5987, no? Rather we should either try to support unicode natively, at least for field-content, or just decide to leave it to the clients (which would be a significant barrier to entry to HTTP programming, albeit backwards compatible with compliant implementations).
> 
> Ultimately I'd like to be able to do things like this:
> 
> Attribute: address="Hochbrückenstraße, 18, München"
> 
> 
> This is what headers look like today:
> 
>        field-name     = token
>        field-value    = *( field-content | LWS )
>        field-content  = <the OCTETs making up the field-value
>                         and consisting of either *TEXT or combinations
>                         of token, separators, and quoted-string>
> 
> 
> On first pass reference to OCTET ("any 8-bit sequence of data") looks promising, but it appears to be further constrained to *TEXT ("any OCTET except CTLs,but including LWS", where CTLs are "any US-ASCII control character(octets 0 - 31) and DEL (127)" and LWS is "[CRLF] 1*( SP | HT )"), or combinations of token ("1*<any CHAR except CTLs or separators>", where CHAR is "any US-ASCII character (octets 0 - 127)" and separators are any of ()<>@,;:\"/[]?={} or SP or HT). Furthermore, "*TEXT MAY contain characters from character sets other than ISO-8859-1 only when encoded according to the rules of RFC 2047", and quoted-string is a derivative of TEXT constrained by quoting and escaping. That's a lot for a developer to understand, and they've probably already lost interest and chosen an envelope format (JSON, XML/SOAP, etc.) by now.
> 
> I'm sure someone (Mark?) said that I could just dump Unicode (or any other binary for that matter) in any new headers I defined, but from what I can tell above we're largely restricted to printable ASCII characters? That being the case, to fix it we would need a version bump(?), in which case we may as well deal with other issues (e.g. performance) ala SPDY? Or we just prioritise backwards compatability and declare it a client library problem?
> 
> Cheers,
> 
> Sam
> 
> On Sun, Oct 30, 2011 at 2:07 PM, Manger, James H <James.H.Manger@team.telstra.com> wrote:
> HTTP currently uses token and quoted-string for various header field parameter values, and recommends these syntaxes for new headers. However neither supports Unicode, which isn't really acceptable today.
> 
> I would like to recommend the JSON string syntax for new header field parameter values. JSON is very widely used on the web, particularly by protocols built on HTTP. There are JSON implementations for basically every computer language. JSON support the full range of Unicode characters. Developers love it.
> 
> A JSON string: is enclosed in double quotes; uses \" and \\ to represent " and \; uses six other \x sequences for other chars; and allows \uXXXX as an escape sequence for any Unicode character [json.org, RFC4627]. An HTTP header profile of JSON string would require any chars outside the printable ASCII set to be escaped.
> 
> 
> RFC5987 "Character Set and Language Encoding for HTTP Header Field Parameters" already offer one way to represent any Unicode string in a HTTP header parameter value, eg foo*=UTF-8''coll%C3%A8gues. However this is not very appealing when defining a new parameter. HTTPbis-p2 already recommends new parameters allow the token and quoted-string syntaxes so supporting RFC5987 for Unicode means implementations have to support 2 parameter names (foo and foo*), 3 syntaxes, and 2 escaping mechanisms (\x in quoted-string, and %xx in RFC5987) -- all for a brand new parameter. Yuck.
> 
> 
> I think the considerations for new headers (issue #231), and advice on defining auth scheme parameters (issue #320), should consider how to support Unicode parameter values -- and json-string would be a good way to do that.
> 
> 
> 
> P.S. json-string could also work in practice in places where quoted-string is defined (such as for parameters of new authentication schemes), since no actual quoted-string value will ever have escaped 'u' as '\u' so '\uXXXX' could be safely interpreted as per JSON instead of as 'uXXXX' as per quoted-string rules.
> 
> --
> James Manger
> 
> 
> 

--
Mark Nottingham   http://www.mnot.net/
Received on Monday, 31 October 2011 22:12:43 UTC