RE: New full Unicode for ES6 idea

Why would converting the existing UCS-2 support to be UTF-16 not be a good idea? There is nothing intrinsically wrong that I can see with that approach and it would be the most compatible with existing scripts, with no special "modes", "flags", or interactions. 

Yes, the complexity of supplementary characters (i.e. non-BMP characters) represented as surrogate pairs must still be dealt with. It would also expose the possibility of invalid strings (with unpaired surrogates). But this would not be unlike other programming languages--or even ES as it exists today. The purity of a "Unicode string" would be watered down, but perhaps not fatally. The Java language went through this (yeah, I know, I know...) and seems to have emerged unscathed. Norbert has a lovely doc here about the choices that lead to this, which seems useful to consider: [1]. W3C I18N Core WG has a wiki page shared with TC39 awhile ago here: [2].

To me, switching to UTF-16 seems like a relatively small, containable, non-destructive change to allow supplementary character support. It's not a pure as a true code-point based "Unicode string" solution. But purity isn't everything.

What am I missing?

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG) <--- hat is OFF in this message

Internationalization is not a feature.
It is an architecture.

[1] http://java.sun.com/developer/technicalArticles/Intl/Supplementary/

[2] http://www.w3.org/International/wiki/JavaScriptInternationalization





> -----Original Message-----
> From: Brendan Eich [mailto:brendan@mozilla.com]
> Sent: Sunday, February 19, 2012 1:34 PM
> To: Wes Garland
> Cc: es-discuss; public-script-coord@w3.org; mranney@voxer.com
> Subject: Re: New full Unicode for ES6 idea
> 
> Wes Garland wrote:
> > Is there a proposal for interaction with JSON?
> 
>  From http://www.ietf.org/rfc/rfc4627, 2.5:
> 
>     To escape an extended character that is not in the Basic Multilingual
>     Plane, the character is represented as a twelve-character sequence,
>     encoding the UTF-16 surrogate pair.  So, for example, a string
>     containing only the G clef character (U+1D11E) may be represented as
>     "\uD834\uDD1E".
> 
> >
> >     Also because inter-compartment traffic is (we conjecture)
> >     infrequent enough to tolerate the proxy/copy overhead.
> >
> >
> > Not to mention that the only thing you'd have to do is to tweak
> > [[get]], charCodeAt and .length when crossing boundaries; you can keep
> > the same backing store.
> 
> String methods are not generally self-hosted, so internal C++ vector access
> would need to change depending on the string's flag bit, in this
> implementation approach.
> 
> > You might not even need to do this is the engine keeps the same
> > backing store for both kinds of strings.
> 
> Yes, sharing the uint16 vector is good. But string methods would have to
> index and .length differently (if I can verb .length ;-).
> 
> >     This means a script intent on comparing strings from two globals
> >     with different BRS settings could indeed tell that one discloses
> >     non-BMP char/codes, e.g. charCodeAt return values >= 0x10000. This
> >     is the *small* new observable I claim we can live with, because
> >     someone opted into it at least in one of the related global objects.
> >
> >
> > Funny question, if I have two strings, both "hello", from two globals
> > with different BRS settings,  are they ==? How about ===?
> 
> Of course, strings with the same characters are == and ===. Strings appear to
> be values. If you think of them as immutable reference types there's still an
> obligation to compare characters for strings because computed strings are
> not intern'ed.
> 
> >     R1. To keep compatibility with DOM APIs, the DOM glue used to
> >     mediate calls from JS to (typically) C++ would have to proxy or
> >     copy any strings containing non-BMP characters. Strings with only
> >     BMP characters would work as today.
> >
> >
> > Is that true if the "full unicode" backing store is 16-bit code units
> > using UTF-16 encoding?  (Any way, it's an implementation detail)
> 
> Yes, because DOMString has intrinsic length and indexing notions and these
> must (pending any coordination with w3c) remain ignorant of the BRS and
> livin' in the '90s (DOM too emerged in the UCS-2 era).
> 
> /be

Received on Sunday, 19 February 2012 22:02:21 UTC