Re: XHR LC comment: header encoding from Boris Zbarsky on 2010-01-04 (public-webapps@w3.org from January to March 2010)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Mon, 04 Jan 2010 14:36:35 -0500
To: Julian Reschke <julian.reschke@gmx.de>
CC: WebApps WG <public-webapps@w3.org>
Message-ID: <4B424343.9060601@mit.edu>

On 1/4/10 11:44 AM, Julian Reschke wrote:
>> This happens to more or less match "decoding as ISO-8859-1", but not
>> quite.
>> ...
>
> Not quite?

More precisely, it happens to not quite match what browsers call 
ISO-8859-1, which is actually Windows-1252.  And in particular, 
ISO-8859-1 doesn't define the behavior of the 0x7F-0x9F range, whereas 
byte-inflation does (mapping the range to various Unicode control 
character) and Windows-1252 does as well, in a different way (mapping 
the range to various printable Unicode characters).

> It at least preserves all the information that was there and would allow
> a caller to re-decode as UTF-8 as a separate step.

Yep.

> Right now there is no interoperable encoding, so the best thing to do in
> APIs that use character sequences instead of octets is to preserve as
> much information as possible.

That seems reasonable...

> It would be nice if we could find out whether anybody relies on the
> current implementation. Maybe switch it back to byte inflation in
> Mozilla trunk?

Mozilla trunk already does byte _inflation_ when converting from header 
bytes into a JavaScript string.  I assume you meant to convert 
JavaScript strings into header bytes via dropping the high byte of each 
16-bit code unit.  However that fails the "preserve as much information 
as possible" test...  In particular, as soon as any Unicode character 
outside the U+0000-U+00FF range is used, byte-dropping loses information.

-Boris

Received on Monday, 4 January 2010 19:37:09 UTC