Re: New full Unicode for ES6 idea from Brendan Eich on 2012-02-19 (public-script-coord@w3.org from January to March 2012)

From: Brendan Eich <brendan@mozilla.com>
Date: Sun, 19 Feb 2012 14:28:52 -0800
To: "Phillips, Addison" <addison@lab126.com>
CC: Wes Garland <wes@page.ca>, es-discuss <es-discuss@mozilla.org>, "public-script-coord@w3.org" <public-script-coord@w3.org>, "mranney@voxer.com" <mranney@voxer.com>
Message-ID: <4F4177A4.2090404@mozilla.com>

Phillips, Addison wrote:
> Why would converting the existing UCS-2 support to be UTF-16 not be a good idea? There is nothing intrinsically wrong that I can see with that approach and it would be the most compatible with existing scripts, with no special "modes", "flags", or interactions.

Allen proposed this, essentially (some confusion surrounded the 
discussion by mixing observable-in-language with 
encoding/format/serialization issues, leading to talk of 32-bit 
characters), last year. As I wrote in the o.p., this led to two 
objections: big implementation hit; incompatible change.

I tackled the second with the BRS and (in detail) mediation across DOM 
window boundaries. This I believe takes the sting out of the first 
(lesser implementation change in light of existing mediation at those 
boundaries).

> Yes, the complexity of supplementary characters (i.e. non-BMP characters) represented as surrogate pairs must still be dealt with.

I'm not sure what you mean. JS today allows (ignoring invalid pairs) 
such surrogates but they count as two indexes and add two to length, not 
one. That is the first problem to fix (ignoring literal escape-notation 
expressiveness).

>   It would also expose the possibility of invalid strings (with unpaired surrogates).

That problem exists today.

>   But this would not be unlike other programming languages--or even ES as it exists today.

Right! We should do better. As I noted, Node.js heavy hitters (mranney 
of Voxer) testify that they want full Unicode, not what's specified 
today with indexing and length-accounting by uint16 storage units.

>   The purity of a "Unicode string" would be watered down, but perhaps not fatally. The Java language went through this (yeah, I know, I know...) and seems to have emerged unscathed.

Java's dead on the client. It is used by botnets (bugzilla.mozilla.org 
recently suffered a DDOS from one, the bad guys didn't even bother 
changing the user-agent from the default one for the Java runtime). See 
Brian Krebs' blog.

>   Norbert has a lovely doc here about the choices that lead to this, which seems useful to consider: [1]. W3C I18N Core WG has a wiki page shared with TC39 awhile ago here: [2].
>
> To me, switching to UTF-16 seems like a relatively small, containable, non-destructive change to allow supplementary character support.

I still don't know what you mean. How would what you call "switching to 
UTF-16" differ from today, where one can inject surrogates into literals 
by transcoding from an HTML document or .js file CSE?

In particular, what do string indexing and .length count, uint16 units 
or characters?

/be

Received on Sunday, 19 February 2012 22:29:22 UTC