- From: Phillips, Addison <addison@lab126.com>
- Date: Sun, 19 Feb 2012 10:39:43 -0800
- To: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Yet another proposal to extend full Unicode support to ECMAScript. > -----Original Message----- > From: Brendan Eich [mailto:brendan@mozilla.com] > Sent: Sunday, February 19, 2012 12:34 AM > To: es-discuss > Cc: public-script-coord@w3.org; Isaac Schlueter; mranney@voxer.com > Subject: New full Unicode for ES6 idea > > Once more unto the breach, dear friends! > > ES1 dates from when Unicode fit in 16 bits, and in those days, nickels had > pictures of bumblebees on 'em ("Gimme five bees for a quarter", you'd say ;- > ). > > Clearly that was a while ago. These days, we would like full 21-bit Unicode > character support in JS. Some (mranney at Voxer) contend that it is a > requirement. > > Full 21-bit Unicode support means all of: > > * indexing by characters, not uint16 storage units; > * counting length as one greater than the last index; and > * supporting escapes with (up to) six hexadecimal digits. > > ES4 saw bold proposals including Lars Hansen's, to allow implementations to > change string indexing and length incompatibly, and let Darwin sort it out. I > recall that was when we agreed to support "\u{XXXXXX}" as an extension for > spelling non-BMP characters. > > Allen's strawman from last year, > http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in > _strings, > proposed a brute-force change to support full Unicode (albeit with too many > hex digits allowed in "\u{...}"), observing that "There are very few places > where the ECMAScript specification has actual dependencies upon the size > of individual characters so the compatibility impact of supporting full Unicode > is quite small." But two problems remained: > > P1. As Allen wrote, "There is a larger impact on actual implementations", and > no implementors that I can recall were satisfied that the cost was acceptable. > It might be, we just didn't know, and there are enough signs of high cost to > create this concern. > > P2. The change is not backward compatible. In JS today, one read a string s > from somewhere and hard-code, e.g., s.indexOf("0xd800" to find part of a > surrogate pair, then advance to the next-indexed uint16 unit and read the > other half, then combine to compute some result. Such usage would break. > > Example from Allen: > > var c = "😁" // where the single character between the quotes is the Unicode > character U+1f638 > > c.length == 2; > c === "\ud83d\ude38"; //the two character UTF-16 encoding of 0x1f683 > c.charCodeAt(0) == 0xd83d; > c.charCodeAt(1) == 0xd338; > > (Allen points out how browsers, node.js, and other environments blindly > handle UTF-8 or whatever incoming format recoding to UTF-16 upstream of > the JS engine, so the above actually works without any spec-language in > ECMA-262 saying it should.) > > So based on a recent twitter/github exchange, gist recorded at > https://gist.github.com/1850768, I would like to propose a variation on > Allen's proposal that resolves both of these problems. Here are resolutions in > reverse order: > > R2. No incompatible change without opt-in. If you hardcode as in Allen's > example, don't opt in without changing your index, length, and char/code-at > assumptions. > > Such opt-in cannot be a pragma since those have lexical scope and affect > code, not the heap where strings and String.prototype methods live. > > We also wish to avoid exposing a "full Unicode" representation type and > duplicated suite of the String static and prototype methods, as Java did. (We > may well want UTF-N transcoding helpers; we certainly want ByteArray <-> > UTF-8 transcoding APIs.) > > True, R2 implies there are two string primitive representations at most, or > more likely "1.x" for some fraction .x. Say, a flag bit in the string header to > distinguish JS's uint16-based indexing ("UCS-2") from non-O(1)-indexing UTF- > 16. Lots of non-observable implementation options here. > > Instead of any such *big* new observables, I propose a so-called "Big Red > [opt-in] Switch" (BRS) on the side of a unit of VM isolation: > specifically the global object. > > Why the global object? Because for many VMs, each global has its own heap > or sub-heap ("compartment"), and all references outside that heap are to > local proxies that copy from, or in the case of immutable data, reference the > remote heap. Also because inter-compartment traffic is (we > conjecture) infrequent enough to tolerate the proxy/copy overhead. > > For strings and String objects, such proxies would consult the remote heap's > BRS setting and transcode indexed access, and .length gets, accordingly. It > doesn't matter if the BRS is in the global or its String constructor or > String.prototype, as the latter are unforgeably linked to the global. > > This means a script intent on comparing strings from two globals with > different BRS settings could indeed tell that one discloses non-BMP > char/codes, e.g. charCodeAt return values >= 0x10000. This is the > *small* new observable I claim we can live with, because someone opted > into it at least in one of the related global objects. > > Note that implementations such as Node.js can pre-set the BRS to "full > Unicode" at startup. Embeddings that fully isolate each global and its > reachable objects and strings pay no string-proxy or -copy overhead. > > R1. To keep compatibility with DOM APIs, the DOM glue used to mediate calls > from JS to (typically) C++ would have to proxy or copy any strings containing > non-BMP characters. Strings with only BMP characters would work as today. > > Note that we are dealing only in spec observables here. It doesn't matter > whether the JS engine uses UTF-8 and the DOM UCS-2 (in which case there is > already a transcoding penalty; IIRC WebKit libxml and libxslt use UTF-8 and so > must transcode to interface with WebKit's DOM). The only issue at this > boundary, I believe, is how indexing and .length work. > > Ok, there you have it: resolutions for both problems that killed the last > assault on Castle '90s-JS. > > Implementations that use uint16 vectors as the character data > representation type for both "UCS-2" and "UTF-16" string variants would > probably want another flag bit per string header indicating whether, for the > UTF-16 case, the string indeed contained any non-BMP characters. If not, no > proxy/copy needed. > > Such implementations probably would benefit from string (primitive > value) proxies not just copies, since the underlying uint16 vector could be > shared by two different string headers with whatever metadata flag bits, etc., > are needed to disclose different length values, access different methods > from distinct globals' String.prototype objects, etc. > > We could certainly also work with the W3C to revise the DOM to check the > BRS setting, if that is possible, to avoid this non-BMP-string proxy/copy > overhead. > > How is the BRS configured? Again, not via a pragma, and not by imperative > state update inside the language (mutating hidden BRS state at a given > program point could leave strings created before mutation observably > different from those created after, unless the implementation in effect > scanned the local heap and wrapped or copied any non-BMP-char-bearing > ones creatd before). > > The obvious way to express the BRS in HTML is a <meta> tag in document > <head>, but I don't want to get hung up on this point. I do welcome expert > guidance. Here is another W3C/WHATWG interaction point. For this reason > I'm cc'ing public-script-coord. > > The upshot of this proposal is to get JS out of the '90s without a mandatory > breaking change. With simple-enough opt-in expressed at coarse-enough > boundaries so as not to impose high cost or unintended string type confusion > bugs, the complexity is mostly borne by implementors, and at less than a 2x > cost comparing string implementations (I think -- demonstration required of > course). > > In particular, Node.js can get modern at startup, and perhaps engines such as > V8 as used in Node could even support compile-time (#ifdef) configury by > which to support only full Unicode. > > Comments welcome. > > /be >
Received on Sunday, 19 February 2012 18:40:14 UTC