FW: New full Unicode for ES6 idea

Yet another proposal to extend full Unicode support to ECMAScript.

> -----Original Message-----
> From: Brendan Eich [mailto:brendan@mozilla.com]
> Sent: Sunday, February 19, 2012 12:34 AM
> To: es-discuss
> Cc: public-script-coord@w3.org; Isaac Schlueter; mranney@voxer.com
> Subject: New full Unicode for ES6 idea
> Once more unto the breach, dear friends!
> ES1 dates from when Unicode fit in 16 bits, and in those days, nickels had
> pictures of bumblebees on 'em ("Gimme five bees for a quarter", you'd say ;-
> ).
> Clearly that was a while ago. These days, we would like full 21-bit Unicode
> character support in JS. Some (mranney at Voxer) contend that it is a
> requirement.
> Full 21-bit Unicode support means all of:
> * indexing by characters, not uint16 storage units;
> * counting length as one greater than the last index; and
> * supporting escapes with (up to) six hexadecimal digits.
> ES4 saw bold proposals including Lars Hansen's, to allow implementations to
> change string indexing and length incompatibly, and let Darwin sort it out. I
> recall that was when we agreed to support "\u{XXXXXX}" as an extension for
> spelling non-BMP characters.
> Allen's strawman from last year,
> http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in

> _strings,
> proposed a brute-force change to support full Unicode (albeit with too many
> hex digits allowed in "\u{...}"), observing that "There are very few places
> where the ECMAScript specification has actual dependencies upon the size
> of individual characters so the compatibility impact of supporting full Unicode
> is quite small." But two problems remained:
> P1. As Allen wrote, "There is a larger impact on actual implementations", and
> no implementors that I can recall were satisfied that the cost was acceptable.
> It might be, we just didn't know, and there are enough signs of high cost to
> create this concern.
> P2. The change is not backward compatible. In JS today, one read a string s
> from somewhere and hard-code, e.g., s.indexOf("0xd800" to find part of a
> surrogate pair, then advance to the next-indexed uint16 unit and read the
> other half, then combine to compute some result. Such usage would break.
> Example from Allen:
> var c = "😁" // where the single character between the quotes is the Unicode
> character U+1f638
> c.length == 2;
> c === "\ud83d\ude38"; //the two character UTF-16 encoding of 0x1f683
> c.charCodeAt(0) == 0xd83d;
> c.charCodeAt(1) == 0xd338;
> (Allen points out how browsers, node.js, and other environments blindly
> handle UTF-8 or whatever incoming format recoding to UTF-16 upstream of
> the JS engine, so the above actually works without any spec-language in
> ECMA-262 saying it should.)
> So based on a recent twitter/github exchange, gist recorded at
> https://gist.github.com/1850768, I would like to propose a variation on
> Allen's proposal that resolves both of these problems. Here are resolutions in
> reverse order:
> R2. No incompatible change without opt-in. If you hardcode as in Allen's
> example, don't opt in without changing your index, length, and char/code-at
> assumptions.
> Such opt-in cannot be a pragma since those have lexical scope and affect
> code, not the heap where strings and String.prototype methods live.
> We also wish to avoid exposing a "full Unicode" representation type and
> duplicated suite of the String static and prototype methods, as Java did. (We
> may well want UTF-N transcoding helpers; we certainly want ByteArray <->
> UTF-8 transcoding APIs.)
> True, R2 implies there are two string primitive representations at most, or
> more likely "1.x" for some fraction .x. Say, a flag bit in the string header to
> distinguish JS's uint16-based indexing ("UCS-2") from non-O(1)-indexing UTF-
> 16. Lots of non-observable implementation options here.
> Instead of any such *big* new observables, I propose a so-called "Big Red
> [opt-in] Switch" (BRS) on the side of a unit of VM isolation:
> specifically the global object.
> Why the global object? Because for many VMs, each global has its own heap
> or sub-heap ("compartment"), and all references outside that heap are to
> local proxies that copy from, or in the case of immutable data, reference the
> remote heap. Also because inter-compartment traffic is (we
> conjecture) infrequent enough to tolerate the proxy/copy overhead.
> For strings and String objects, such proxies would consult the remote heap's
> BRS setting and transcode indexed access, and .length gets, accordingly. It
> doesn't matter if the BRS is in the global or its String constructor or
> String.prototype, as the latter are unforgeably linked to the global.
> This means a script intent on comparing strings from two globals with
> different BRS settings could indeed tell that one discloses non-BMP
> char/codes, e.g. charCodeAt return values >= 0x10000. This is the
> *small* new observable I claim we can live with, because someone opted
> into it at least in one of the related global objects.
> Note that implementations such as Node.js can pre-set the BRS to "full
> Unicode" at startup. Embeddings that fully isolate each global and its
> reachable objects and strings pay no string-proxy or -copy overhead.
> R1. To keep compatibility with DOM APIs, the DOM glue used to mediate calls
> from JS to (typically) C++ would have to proxy or copy any strings containing
> non-BMP characters. Strings with only BMP characters would work as today.
> Note that we are dealing only in spec observables here. It doesn't matter
> whether the JS engine uses UTF-8 and the DOM UCS-2 (in which case there is
> already a transcoding penalty; IIRC WebKit libxml and libxslt use UTF-8 and so
> must transcode to interface with WebKit's DOM). The only issue at this
> boundary, I believe, is how indexing and .length work.
> Ok, there you have it: resolutions for both problems that killed the last
> assault on Castle '90s-JS.
> Implementations that use uint16 vectors as the character data
> representation type for both "UCS-2" and "UTF-16" string variants would
> probably want another flag bit per string header indicating whether, for the
> UTF-16 case, the string indeed contained any non-BMP characters. If not, no
> proxy/copy needed.
> Such implementations probably would benefit from string (primitive
> value) proxies not just copies, since the underlying uint16 vector could be
> shared by two different string headers with whatever metadata flag bits, etc.,
> are needed to disclose different length values, access different methods
> from distinct globals' String.prototype objects, etc.
> We could certainly also work with the W3C to revise the DOM to check the
> BRS setting, if that is possible, to avoid this non-BMP-string proxy/copy
> overhead.
> How is the BRS configured? Again, not via a pragma, and not by imperative
> state update inside the language (mutating hidden BRS state at a given
> program point could leave strings created before mutation observably
> different from those created after, unless the implementation in effect
> scanned the local heap and wrapped or copied any non-BMP-char-bearing
> ones creatd before).
> The obvious way to express the BRS in HTML is a <meta> tag in document
> <head>, but I don't want to get hung up on this point. I do welcome expert
> guidance. Here is another W3C/WHATWG interaction point. For this reason
> I'm cc'ing public-script-coord.
> The upshot of this proposal is to get JS out of the '90s without a mandatory
> breaking change. With simple-enough opt-in expressed at coarse-enough
> boundaries so as not to impose high cost or unintended string type confusion
> bugs, the complexity is mostly borne by implementors, and at less than a 2x
> cost comparing string implementations (I think -- demonstration required of
> course).
> In particular, Node.js can get modern at startup, and perhaps engines such as
> V8 as used in Node could even support compile-time (#ifdef) configury by
> which to support only full Unicode.
> Comments welcome.
> /be

Received on Sunday, 19 February 2012 18:40:14 UTC