Unicode in HTTP headers (was Re: json-string for HTTP header field parameter values) from Sam Johnston on 2011-10-31 (ietf-http-wg@w3.org from October to December 2011)

From: Sam Johnston <samj@samj.net>
Date: Mon, 31 Oct 2011 13:41:18 +0100
To: "Manger, James H" <James.H.Manger@team.telstra.com>
Cc: httpbis Group <ietf-http-wg@w3.org>
Message-ID: <CAKTR03-V+JMN7se4=DXjCjTN2JMJzsV4KYjTuuwsDz_D1LseZQ@mail.gmail.com>
On one hand I agree with James' observation that "neither supports Unicode,
which isn't really acceptable today" — while I'm no i18n expert, having
lived in Europe for almost a decade (prior to which I, among other things,
also worked at Telstra) it is still infuriating and often dangerous when
sites and applications inexplicably garble names, addresses, text, etc.
This is of such importance that I even suggested a version bump could be
justified a few years ago (
http://lists.w3.org/Archives/Public/ietf-http-wg/2009JulSep/0842.html),
prior to my "year of darkness" at Google.

On the other hand I think putting JSON anywhere near the headers is a
terrible, horrible, no good, very bad idea — even if constrained to the
string syntax as suggested, among other things for reasons Mark has already
covered — if anything it belongs as a character set (
http://www.iana.org/assignments/character-sets) for use with RFC5987, no?
Rather we should either try to support unicode natively, at least for
field-content, or just decide to leave it to the clients (which would be a
significant barrier to entry to HTTP programming, albeit backwards
compatible with compliant implementations).

Ultimately I'd like to be able to do things like this:

Attribute: address="Hochbrückenstraße, 18, München"


This is what headers look like today:

       field-name     = token
       field-value    = *( field-content | LWS )
       field-content  = <the OCTETs making up the field-value
                        and consisting of either *TEXT or combinations
                        of token, separators, and quoted-string>


On first pass reference to OCTET ("any 8-bit sequence of data") looks
promising, but it appears to be further constrained to *TEXT ("any OCTET
except CTLs,but including LWS", where CTLs are "any US-ASCII control
character(octets 0 - 31) and DEL (127)" and LWS is "[CRLF] 1*( SP | HT )"),
or combinations of token ("1*<any CHAR except CTLs or separators>", where
CHAR is "any US-ASCII character (octets 0 - 127)" and separators are any
of ()<>@,;:\"/[]?={} or SP or HT). Furthermore, "*TEXT MAY contain
characters from character sets other than ISO-8859-1 only when encoded
according to the rules of RFC 2047", and quoted-string is a derivative of
TEXT constrained by quoting and escaping. That's a lot for a developer to
understand, and they've probably already lost interest and chosen an
envelope format (JSON, XML/SOAP, etc.) by now.

I'm sure someone (Mark?) said that I could just dump Unicode (or any other
binary for that matter) in any new headers I defined, but from what I can
tell above we're largely restricted to printable ASCII characters? That
being the case, to fix it we would need a version bump(?), in which case we
may as well deal with other issues (e.g. performance) ala SPDY? Or we just
prioritise backwards compatability and declare it a client library problem?

Cheers,

Sam

On Sun, Oct 30, 2011 at 2:07 PM, Manger, James H <
James.H.Manger@team.telstra.com> wrote:

> HTTP currently uses token and quoted-string for various header field
> parameter values, and recommends these syntaxes for new headers. However
> neither supports Unicode, which isn't really acceptable today.
>
> I would like to recommend the JSON string syntax for new header field
> parameter values. JSON is very widely used on the web, particularly by
> protocols built on HTTP. There are JSON implementations for basically every
> computer language. JSON support the full range of Unicode characters.
> Developers love it.
>
> A JSON string: is enclosed in double quotes; uses \" and \\ to represent "
> and \; uses six other \x sequences for other chars; and allows \uXXXX as an
> escape sequence for any Unicode character [json.org, RFC4627]. An HTTP
> header profile of JSON string would require any chars outside the printable
> ASCII set to be escaped.
>
>
> RFC5987 "Character Set and Language Encoding for HTTP Header Field
> Parameters" already offer one way to represent any Unicode string in a HTTP
> header parameter value, eg foo*=UTF-8''coll%C3%A8gues. However this is not
> very appealing when defining a new parameter. HTTPbis-p2 already recommends
> new parameters allow the token and quoted-string syntaxes so supporting
> RFC5987 for Unicode means implementations have to support 2 parameter names
> (foo and foo*), 3 syntaxes, and 2 escaping mechanisms (\x in quoted-string,
> and %xx in RFC5987) -- all for a brand new parameter. Yuck.
>
>
> I think the considerations for new headers (issue #231), and advice on
> defining auth scheme parameters (issue #320), should consider how to
> support Unicode parameter values -- and json-string would be a good way to
> do that.
>
>
>
> P.S. json-string could also work in practice in places where quoted-string
> is defined (such as for parameters of new authentication schemes), since no
> actual quoted-string value will ever have escaped 'u' as '\u' so '\uXXXX'
> could be safely interpreted as per JSON instead of as 'uXXXX' as per
> quoted-string rules.
>
> --
> James Manger
>
>
>
Received on Monday, 31 October 2011 12:42:17 UTC