Re: Unicode in HTTP headers (was Re: json-string for HTTP header field parameter values) from Dale Anderson on 2011-11-11 (ietf-http-wg@w3.org from October to December 2011)

From: Dale Anderson <dra@redevised.net>
Date: Fri, 11 Nov 2011 14:45:34 -0800
To: Sam Johnston <samj@samj.net>
Cc: "Manger, James H" <James.H.Manger@team.telstra.com>, httpbis Group <ietf-http-wg@w3.org>
Message-ID: <CANNRn6JWG6h71Wc+-R7a_5LaEO=+nyzNHROz_aHtFno9Hx2bEg@mail.gmail.com>
Hey Sam, I read some of your threads if you want to make the programmer
work easier just write a helper library call that acts like what you want
but (for example) base64 encodes the header field values.

It doesn't make objectively deciphering the protocol easier for test and
support, hehe, but it is plainly easy for the client and server
programming, just about as easy as was using the arbitrary unicode literals
intact functionally for the field-values without those 7 or 8-bit-dirty
side effects.

One might have to get a little bit more elaborate system for the field names

Seems to satisfy your requirement though, in a kind of standard way.

samshttpclient.request('GET', '/', headers={'Attribute':
'address="Hochbrückenstraße, 18, München"'})

..8<.. meanwhile, on the wire..

GET / HTTP/1.1
Host: johnston
Attribute: YWRkcmVzcz0iSG9jaGJyw7xja2Vuc3RyYcOfZSwgMTgsIE3DvG5jaGVuIg==

..>8..


Then somewhere at the data center..

samserverequest.headers['attribute']
=> address="Hochbrückenstraße, 18, München"


Regards,

Dale




2011/10/31 Sam Johnston <samj@samj.net>

> On one hand I agree with James' observation that "neither supports
> Unicode, which isn't really acceptable today" — while I'm no i18n expert,
> having lived in Europe for almost a decade (prior to which I, among other
> things, also worked at Telstra) it is still infuriating and often dangerous
> when sites and applications inexplicably garble names, addresses, text,
> etc. This is of such importance that I even suggested a version bump could
> be justified a few years ago (
> http://lists.w3.org/Archives/Public/ietf-http-wg/2009JulSep/0842.html),
> prior to my "year of darkness" at Google.
>
> On the other hand I think putting JSON anywhere near the headers is a
> terrible, horrible, no good, very bad idea — even if constrained to the
> string syntax as suggested, among other things for reasons Mark has already
> covered — if anything it belongs as a character set (
> http://www.iana.org/assignments/character-sets) for use with RFC5987, no?
> Rather we should either try to support unicode natively, at least for
> field-content, or just decide to leave it to the clients (which would be a
> significant barrier to entry to HTTP programming, albeit backwards
> compatible with compliant implementations).
>
> Ultimately I'd like to be able to do things like this:
>
> Attribute: address="Hochbrückenstraße, 18, München"
>
>
> This is what headers look like today:
>
>        field-name     = token
>        field-value    = *( field-content | LWS )
>        field-content  = <the OCTETs making up the field-value
>                         and consisting of either *TEXT or combinations
>                         of token, separators, and quoted-string>
>
>
> On first pass reference to OCTET ("any 8-bit sequence of data") looks
> promising, but it appears to be further constrained to *TEXT ("any OCTET
> except CTLs,but including LWS", where CTLs are "any US-ASCII control
> character(octets 0 - 31) and DEL (127)" and LWS is "[CRLF] 1*( SP | HT )"),
> or combinations of token ("1*<any CHAR except CTLs or separators>", where
> CHAR is "any US-ASCII character (octets 0 - 127)" and separators are any
> of ()<>@,;:\"/[]?={} or SP or HT). Furthermore, "*TEXT MAY contain
> characters from character sets other than ISO-8859-1 only when encoded
> according to the rules of RFC 2047", and quoted-string is a derivative of
> TEXT constrained by quoting and escaping. That's a lot for a developer to
> understand, and they've probably already lost interest and chosen an
> envelope format (JSON, XML/SOAP, etc.) by now.
>
> I'm sure someone (Mark?) said that I could just dump Unicode (or any other
> binary for that matter) in any new headers I defined, but from what I can
> tell above we're largely restricted to printable ASCII characters? That
> being the case, to fix it we would need a version bump(?), in which case we
> may as well deal with other issues (e.g. performance) ala SPDY? Or we just
> prioritise backwards compatability and declare it a client library problem?
>
> Cheers,
>
> Sam
>
> On Sun, Oct 30, 2011 at 2:07 PM, Manger, James H <
> James.H.Manger@team.telstra.com> wrote:
>
>> HTTP currently uses token and quoted-string for various header field
>> parameter values, and recommends these syntaxes for new headers. However
>> neither supports Unicode, which isn't really acceptable today.
>>
>> I would like to recommend the JSON string syntax for new header field
>> parameter values. JSON is very widely used on the web, particularly by
>> protocols built on HTTP. There are JSON implementations for basically every
>> computer language. JSON support the full range of Unicode characters.
>> Developers love it.
>>
>> A JSON string: is enclosed in double quotes; uses \" and \\ to represent
>> " and \; uses six other \x sequences for other chars; and allows \uXXXX as
>> an escape sequence for any Unicode character [json.org, RFC4627]. An
>> HTTP header profile of JSON string would require any chars outside the
>> printable ASCII set to be escaped.
>>
>>
>> RFC5987 "Character Set and Language Encoding for HTTP Header Field
>> Parameters" already offer one way to represent any Unicode string in a HTTP
>> header parameter value, eg foo*=UTF-8''coll%C3%A8gues. However this is not
>> very appealing when defining a new parameter. HTTPbis-p2 already recommends
>> new parameters allow the token and quoted-string syntaxes so supporting
>> RFC5987 for Unicode means implementations have to support 2 parameter names
>> (foo and foo*), 3 syntaxes, and 2 escaping mechanisms (\x in quoted-string,
>> and %xx in RFC5987) -- all for a brand new parameter. Yuck.
>>
>>
>> I think the considerations for new headers (issue #231), and advice on
>> defining auth scheme parameters (issue #320), should consider how to
>> support Unicode parameter values -- and json-string would be a good way to
>> do that.
>>
>>
>>
>> P.S. json-string could also work in practice in places where
>> quoted-string is defined (such as for parameters of new authentication
>> schemes), since no actual quoted-string value will ever have escaped 'u' as
>> '\u' so '\uXXXX' could be safely interpreted as per JSON instead of as
>> 'uXXXX' as per quoted-string rules.
>>
>> --
>> James Manger
>>
>>
>>
>
Received on Friday, 11 November 2011 22:46:14 UTC