Re: [FileAPI, common] UTF-16 to UTF-8 conversion from Simon Pieters on 2012-02-28 (public-webapps@w3.org from January to March 2012)

From: Simon Pieters <simonp@opera.com>
Date: Tue, 28 Feb 2012 13:57:03 +0100
To: "Jonas Sicking" <jonas@sicking.cc>
Cc: "Arun Ranganathan" <aranganathan@mozilla.com>, "Glenn Maynard" <glenn@zewt.org>, "Eric U" <ericu@google.com>, public-webapps@w3.org
Message-ID: <op.wad3ldgnidj3kv@simons-macbook-pro.local>

On Tue, 28 Feb 2012 13:05:37 +0100, Jonas Sicking <jonas@sicking.cc> wrote:

>> If we can't U+FFFD unpaired surrogates on paste, I agree it makes sense  
>> to
>> U+FFFD them in APIs. If the only way to get them is a JS escape, then an
>> exception seems OK.
>
> People use JS strings to handle binary data. This is something that
> has worked since the dawn of JS and is something that I believe is
> defined to work in recent ECMAScript specs.
>
> I don't think that we can start restricting that and try to enforce
> that JS-strings always contain valid UTF16.

Right.

> So I think our only option is to make all APIs which does UTF16->UTF8
> conversion explicitly define how to deal with invalid surrogates.

Sure, I don't suggest we leave it undefined.

> My
> preference would be to deal with them by encoding them to U+FFFD for
> the same reason that we let the HTML parser do error recovery rather
> than XML-style draconian error handling.

I'm not really opposed to making APIs use U+FFFD instead of exception, but  
I'm not entirely convinced, either. If people use binary data in strings  
and want to use them in these APIs, U+FFFDing lone surrogates is going to  
"silently" scramble their data. Why is this better than throwing an  
exception?

-- 
Simon Pieters
Opera Software

Received on Tuesday, 28 February 2012 12:57:42 UTC