- From: Jonas Sicking <jonas@sicking.cc>
- Date: Tue, 28 Feb 2012 13:05:37 +0100
- To: Simon Pieters <simonp@opera.com>
- Cc: Arun Ranganathan <aranganathan@mozilla.com>, Glenn Maynard <glenn@zewt.org>, Eric U <ericu@google.com>, public-webapps@w3.org
On Tue, Feb 28, 2012 at 7:11 AM, Simon Pieters <simonp@opera.com> wrote: > On Tue, 28 Feb 2012 01:05:44 +0100, Glenn Maynard <glenn@zewt.org> wrote: > >> On Mon, Feb 27, 2012 at 5:34 PM, Arun Ranganathan >> <aranganathan@mozilla.com>wrote: >> >>> Simon, >>> >>> Is the relevant part of HTML sufficient to refer to? >>> http://dev.w3.org/html5/spec/Overview.html#utf-8 > > > I was thinking of "If the data argument has any unpaired surrogates, then > throw a SyntaxError exception.". > http://www.whatwg.org/specs/web-apps/current-work/multipage/network.html#dom-websocket-send > > >> >> That defines decoding UTF-8 to Unicode strings. You need the reverse. >> >> Using a replacement scheme like UTF-8 decoding, instead of a hard >> exception, seems more consistent with how encodings in general are >> handled. Otherwise, you'll end up with bugs in code if, for example, >> people paste in unpaired surrogates (Firefox allows this, last I checked), > > > Maybe unpaired surrogates should be converted to U+FFFD on paste. Are there > other cases? > > >> causing unexpected exceptions in code. Instead, just convert them to >> U+FFFD, which gives much more graceful error handling for such a rare case >> that most people will never handle explicitly. > > > If we can't U+FFFD unpaired surrogates on paste, I agree it makes sense to > U+FFFD them in APIs. If the only way to get them is a JS escape, then an > exception seems OK. People use JS strings to handle binary data. This is something that has worked since the dawn of JS and is something that I believe is defined to work in recent ECMAScript specs. I don't think that we can start restricting that and try to enforce that JS-strings always contain valid UTF16. So I think our only option is to make all APIs which does UTF16->UTF8 conversion explicitly define how to deal with invalid surrogates. My preference would be to deal with them by encoding them to U+FFFD for the same reason that we let the HTML parser do error recovery rather than XML-style draconian error handling. / Jonas
Received on Tuesday, 28 February 2012 12:06:55 UTC