Re: [FileAPI, common] UTF-16 to UTF-8 conversion from Jonas Sicking on 2012-02-28 (public-webapps@w3.org from January to March 2012)

From: Jonas Sicking <jonas@sicking.cc>
Date: Tue, 28 Feb 2012 13:05:37 +0100
To: Simon Pieters <simonp@opera.com>
Cc: Arun Ranganathan <aranganathan@mozilla.com>, Glenn Maynard <glenn@zewt.org>, Eric U <ericu@google.com>, public-webapps@w3.org
Message-ID: <CA+c2ei93qm3Ab=39UoqwxMnyQGHSniD9-ZUsVgtkkxYpnYmahw@mail.gmail.com>

On Tue, Feb 28, 2012 at 7:11 AM, Simon Pieters <simonp@opera.com> wrote:
> On Tue, 28 Feb 2012 01:05:44 +0100, Glenn Maynard <glenn@zewt.org> wrote:
>
>> On Mon, Feb 27, 2012 at 5:34 PM, Arun Ranganathan
>> <aranganathan@mozilla.com>wrote:
>>
>>> Simon,
>>>
>>> Is the relevant part of HTML sufficient to refer to?
>>> http://dev.w3.org/html5/spec/Overview.html#utf-8
>
>
> I was thinking of "If the data argument has any unpaired surrogates, then
> throw a SyntaxError exception.".
> http://www.whatwg.org/specs/web-apps/current-work/multipage/network.html#dom-websocket-send
>
>
>>
>> That defines decoding UTF-8 to Unicode strings.  You need the reverse.
>>
>> Using a replacement scheme like UTF-8 decoding, instead of a hard
>> exception, seems more consistent with how encodings in general are
>> handled.  Otherwise, you'll end up with bugs in code if, for example,
>> people paste in unpaired surrogates (Firefox allows this, last I checked),
>
>
> Maybe unpaired surrogates should be converted to U+FFFD on paste. Are there
> other cases?
>
>
>> causing unexpected exceptions in code.  Instead, just convert them to
>> U+FFFD, which gives much more graceful error handling for such a rare case
>> that most people will never handle explicitly.
>
>
> If we can't U+FFFD unpaired surrogates on paste, I agree it makes sense to
> U+FFFD them in APIs. If the only way to get them is a JS escape, then an
> exception seems OK.

People use JS strings to handle binary data. This is something that
has worked since the dawn of JS and is something that I believe is
defined to work in recent ECMAScript specs.

I don't think that we can start restricting that and try to enforce
that JS-strings always contain valid UTF16.

So I think our only option is to make all APIs which does UTF16->UTF8
conversion explicitly define how to deal with invalid surrogates. My
preference would be to deal with them by encoding them to U+FFFD for
the same reason that we let the HTML parser do error recovery rather
than XML-style draconian error handling.

/ Jonas

Received on Tuesday, 28 February 2012 12:06:55 UTC