Re: XHR LC comment: header encoding from Boris Zbarsky on 2010-01-04 (public-webapps@w3.org from January to March 2010)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Mon, 04 Jan 2010 11:34:09 -0500
To: Julian Reschke <julian.reschke@gmx.de>
CC: Anne van Kesteren <annevk@opera.com>, WebApps WG <public-webapps@w3.org>
Message-ID: <4B421881.2060300@mit.edu>

On 1/4/10 11:17 AM, Julian Reschke wrote:
>>> For request headers, I would assume that the character encoding is
>>> ISO-8859-1, and if a character can't be encoded using ISO-8859-1,
>>> some kind of error handling occurs (ignore the character/ignore the
>>> header/throw?).
>>
>> From my limited testing it seems Firefox, Chrome, and Internet
>> Explorer use UTF-8 octets. E.g. "\xFF" in ECMAScript gets transmitted
>> as C3 BF (in octets). Opera sends "\xFF" as FF.

That's what Gecko does, correct.

>>> For response headers, I'd expect that the octet sequence is decoded
>>> using ISO-8859-1; so no specific error handling would be needed
>>> (although the result may be funny when the intended encoding was
>>
>> Firefox, Opera, and Internet Explorer indeed do this. Chrome decodes
>> as UTF-8 as far as I can tell.

More precisely, what Gecko does here is to take the raw byte string and 
byte-inflate it (by setting the high byte of each 16-bit code unit to 0 
and the low byte to the corresponding byte of the given byte string) 
before returning it to JS.

This happens to more or less match "decoding as ISO-8859-1", but not quite.

> Thanks for doing the testing. The discrepancy between setting and
> getting worries me a lot :-).

In Gecko's case it seems to be an accident, at least historically.  The 
getter and setter used to both do byte ops only (so byte inflation in 
the getter, and dropping the high byte in the setter) until the fix for 
<https://bugzilla.mozilla.org/show_bug.cgi?id=232493>.  The review 
comments at <https://bugzilla.mozilla.org/show_bug.cgi?id=232493#c4> 
point out the UTF-8-vs-byte-inflation inconsistency here, but didn;t 
seem to get addressed...

>  From HTTP's point of view, the header field value really is opaque. So
> you can put there anything, as long as it fits into the header field ABNF.

True; what does that mean for converting header values to 16-bit code 
units in practice?  Seems like byte-inflation might be the only 
reasonable thing to do...

> Of course that only helps if senders and receivers agree on the
> encoding.

True, but "encoding" here needs to mean more than just "encoding of 
Unicode", since one can just stick random byte arrays, within the ABNF 
restrictions, in the header, right?

-Boris

Received on Monday, 4 January 2010 16:34:44 UTC