- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Mon, 04 Jan 2010 14:36:35 -0500
- To: Julian Reschke <julian.reschke@gmx.de>
- CC: WebApps WG <public-webapps@w3.org>
On 1/4/10 11:44 AM, Julian Reschke wrote: >> This happens to more or less match "decoding as ISO-8859-1", but not >> quite. >> ... > > Not quite? More precisely, it happens to not quite match what browsers call ISO-8859-1, which is actually Windows-1252. And in particular, ISO-8859-1 doesn't define the behavior of the 0x7F-0x9F range, whereas byte-inflation does (mapping the range to various Unicode control character) and Windows-1252 does as well, in a different way (mapping the range to various printable Unicode characters). > It at least preserves all the information that was there and would allow > a caller to re-decode as UTF-8 as a separate step. Yep. > Right now there is no interoperable encoding, so the best thing to do in > APIs that use character sequences instead of octets is to preserve as > much information as possible. That seems reasonable... > It would be nice if we could find out whether anybody relies on the > current implementation. Maybe switch it back to byte inflation in > Mozilla trunk? Mozilla trunk already does byte _inflation_ when converting from header bytes into a JavaScript string. I assume you meant to convert JavaScript strings into header bytes via dropping the high byte of each 16-bit code unit. However that fails the "preserve as much information as possible" test... In particular, as soon as any Unicode character outside the U+0000-U+00FF range is used, byte-dropping loses information. -Boris
Received on Monday, 4 January 2010 19:37:09 UTC