- From: Norbert Lindenberg <ecmascript@norbertlindenberg.com>
- Date: Tue, 21 Feb 2012 18:05:24 -0800
- To: Brendan Eich <brendan@mozilla.com>, mranney@voxer.com, i@izs.me
- Cc: Norbert Lindenberg <ecmascript@norbertlindenberg.com>, es-discuss <es-discuss@mozilla.org>, public-script-coord@w3.org
I'll reply to Brendan's proposal in two parts: first about the goals for supplementary character support, second about the BRS. > Full 21-bit Unicode support means all of: > > * indexing by characters, not uint16 storage units; > * counting length as one greater than the last index; and > * supporting escapes with (up to) six hexadecimal digits. For me, full 21-bit Unicode support has a different priority list. First come the essentials: Regular expressions; functions that interpret strings; the overall sense that all Unicode characters are supported. 1) Regular expressions must recognize supplementary characters as atomic entities, and interpret them according to Unicode semantics. Look at the contortions one has to go through currently to describe a simple character class that includes supplementary characters: https://github.com/roozbehp/yui3-gallery/blob/master/src/gallery-intl-bidi/js/intl-bidi.js Read up on why it has to be done this way, and see to what extremes some people are going to make supplementary characters work despite ECMAScript: http://inimino.org/~inimino/blog/javascript_cset Now, try to figure out how you'd convert a user-entered string to a regular expression such that you can search for the string without case distinction, where the string may contain supplementary characters such as "πΆπ²π" (Deseret for "one"). Regular expressions matter a lot here because, if done properly, they eliminate much of the need for iterating over strings manually. 2) Built-in functions that interpret strings have to recognize supplementary characters as atomic entities and interpret them according to their Unicode semantics. The list of functions in ES5 that violate this principle is actually rather short: Besides the String functions relying on regular expressions (match, replace, search, split), they're the String case conversion functions (toLowerCase, toLocaleLowerCase, toUpperCase, toLocaleUpperCase) and the relational comparison for strings (11.8.5). But the principle is also important for new functionality being considered for ES6 and above. 3) It must be clear that the full Unicode character set is allowed and supported. This means at least getting rid of the reference to UCS-2 (clause 2) and the bizarre equivalence between characters and UTF-16 code units (clause 6). ECMAScript has already defined several ways to create UTF-16 strings containing supplementary characters (parsing UTF-8 source; using Unicode escapes for surrogate pairs), and lets applications freely pass around such strings. Browsers have surrounded ECMAScript implementations with text input, text rendering, DOM APIs, and XMLHTTPRequest with full Unicode support, and generally use full UTF-16 to exchange text with their ECMAScript subsystem. Developers have used this to build applications that support supplementary characters, hacking around the remaining gaps in ECMAScript as seen above. But, as in the bug report that Brendan pointed to this morning (http://code.google.com/p/v8/issues/detail?id=761), the mention of UCS-2 is still used by some to excuse bugs. Only after these essentials come the niceties of String representation and Unicode escapes: 4) 1 String element to 1 Unicode code point is indeed a very nice and desirable relationship. Unlike Java, where binary compatibility between virtual machines made a change from UTF-16 to UTF-32 impossible, JavaScript needs to be compatible only at the source code level - or maybe, with a BRS, not even that. 5) If we don't go for UTF-32, then there should be a few functions to simplify access to strings in terms of code points, such as String.fromCodePoint, String.prototype.codePointAt. 6) I strongly prefer the use of plain characters over Unicode escapes in source code, because plain text is much easier to read than sequences of hex values. However, the need for Unicode escapes is greater in the space of supplementary characters because here we often have to reference characters for which our operating systems don't have glyphs yet. And \u{1D11E} certainly makes it easier to cross-reference a character than \uD834\uDD1E. The new escape syntax therefore should be on the list, at low priority. I think it would help if other people involved in this discussion also clarified what exactly their requirements are for "full Unicode support". Norbert On Feb 19, 2012, at 0:33 , Brendan Eich wrote: > Once more unto the breach, dear friends! > > ES1 dates from when Unicode fit in 16 bits, and in those days, nickels had pictures of bumblebees on 'em ("Gimme five bees for a quarter", you'd say ;-). > > Clearly that was a while ago. These days, we would like full 21-bit Unicode character support in JS. Some (mranney at Voxer) contend that it is a requirement. > > Full 21-bit Unicode support means all of: > > * indexing by characters, not uint16 storage units; > * counting length as one greater than the last index; and > * supporting escapes with (up to) six hexadecimal digits. > > ES4 saw bold proposals including Lars Hansen's, to allow implementations to change string indexing and length incompatibly, and let Darwin sort it out. I recall that was when we agreed to support "\u{XXXXXX}" as an extension for spelling non-BMP characters. > > Allen's strawman from last year, http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings, proposed a brute-force change to support full Unicode (albeit with too many hex digits allowed in "\u{...}"), observing that "There are very few places where the ECMAScript specification has actual dependencies upon the size of individual characters so the compatibility impact of supporting full Unicode is quite small." But two problems remained: > > P1. As Allen wrote, "There is a larger impact on actual implementations", and no implementors that I can recall were satisfied that the cost was acceptable. It might be, we just didn't know, and there are enough signs of high cost to create this concern. > > P2. The change is not backward compatible. In JS today, one read a string s from somewhere and hard-code, e.g., s.indexOf("0xd800" to find part of a surrogate pair, then advance to the next-indexed uint16 unit and read the other half, then combine to compute some result. Such usage would break. > > Example from Allen: > > var c = "π" // where the single character between the quotes is the Unicode character U+1f638 > > c.length == 2; > c === "\ud83d\ude38"; //the two character UTF-16 encoding of 0x1f683 > c.charCodeAt(0) == 0xd83d; > c.charCodeAt(1) == 0xd338; > > (Allen points out how browsers, node.js, and other environments blindly handle UTF-8 or whatever incoming format recoding to UTF-16 upstream of the JS engine, so the above actually works without any spec-language in ECMA-262 saying it should.) > > So based on a recent twitter/github exchange, gist recorded at https://gist.github.com/1850768, I would like to propose a variation on Allen's proposal that resolves both of these problems. Here are resolutions in reverse order: > > R2. No incompatible change without opt-in. If you hardcode as in Allen's example, don't opt in without changing your index, length, and char/code-at assumptions. > > Such opt-in cannot be a pragma since those have lexical scope and affect code, not the heap where strings and String.prototype methods live. > > We also wish to avoid exposing a "full Unicode" representation type and duplicated suite of the String static and prototype methods, as Java did. (We may well want UTF-N transcoding helpers; we certainly want ByteArray <-> UTF-8 transcoding APIs.) > > True, R2 implies there are two string primitive representations at most, or more likely "1.x" for some fraction .x. Say, a flag bit in the string header to distinguish JS's uint16-based indexing ("UCS-2") from non-O(1)-indexing UTF-16. Lots of non-observable implementation options here. > > Instead of any such *big* new observables, I propose a so-called "Big Red [opt-in] Switch" (BRS) on the side of a unit of VM isolation: specifically the global object. > > Why the global object? Because for many VMs, each global has its own heap or sub-heap ("compartment"), and all references outside that heap are to local proxies that copy from, or in the case of immutable data, reference the remote heap. Also because inter-compartment traffic is (we conjecture) infrequent enough to tolerate the proxy/copy overhead. > > For strings and String objects, such proxies would consult the remote heap's BRS setting and transcode indexed access, and .length gets, accordingly. It doesn't matter if the BRS is in the global or its String constructor or String.prototype, as the latter are unforgeably linked to the global. > > This means a script intent on comparing strings from two globals with different BRS settings could indeed tell that one discloses non-BMP char/codes, e.g. charCodeAt return values >= 0x10000. This is the *small* new observable I claim we can live with, because someone opted into it at least in one of the related global objects. > > Note that implementations such as Node.js can pre-set the BRS to "full Unicode" at startup. Embeddings that fully isolate each global and its reachable objects and strings pay no string-proxy or -copy overhead. > > R1. To keep compatibility with DOM APIs, the DOM glue used to mediate calls from JS to (typically) C++ would have to proxy or copy any strings containing non-BMP characters. Strings with only BMP characters would work as today. > > Note that we are dealing only in spec observables here. It doesn't matter whether the JS engine uses UTF-8 and the DOM UCS-2 (in which case there is already a transcoding penalty; IIRC WebKit libxml and libxslt use UTF-8 and so must transcode to interface with WebKit's DOM). The only issue at this boundary, I believe, is how indexing and .length work. > > Ok, there you have it: resolutions for both problems that killed the last assault on Castle '90s-JS. > > Implementations that use uint16 vectors as the character data representation type for both "UCS-2" and "UTF-16" string variants would probably want another flag bit per string header indicating whether, for the UTF-16 case, the string indeed contained any non-BMP characters. If not, no proxy/copy needed. > > Such implementations probably would benefit from string (primitive value) proxies not just copies, since the underlying uint16 vector could be shared by two different string headers with whatever metadata flag bits, etc., are needed to disclose different length values, access different methods from distinct globals' String.prototype objects, etc. > > We could certainly also work with the W3C to revise the DOM to check the BRS setting, if that is possible, to avoid this non-BMP-string proxy/copy overhead. > > How is the BRS configured? Again, not via a pragma, and not by imperative state update inside the language (mutating hidden BRS state at a given program point could leave strings created before mutation observably different from those created after, unless the implementation in effect scanned the local heap and wrapped or copied any non-BMP-char-bearing ones creatd before). > > The obvious way to express the BRS in HTML is a <meta> tag in document <head>, but I don't want to get hung up on this point. I do welcome expert guidance. Here is another W3C/WHATWG interaction point. For this reason I'm cc'ing public-script-coord. > > The upshot of this proposal is to get JS out of the '90s without a mandatory breaking change. With simple-enough opt-in expressed at coarse-enough boundaries so as not to impose high cost or unintended string type confusion bugs, the complexity is mostly borne by implementors, and at less than a 2x cost comparing string implementations (I think -- demonstration required of course). > > In particular, Node.js can get modern at startup, and perhaps engines such as V8 as used in Node could even support compile-time (#ifdef) configury by which to support only full Unicode. > > Comments welcome. > > /be > > _______________________________________________ > es-discuss mailing list > es-discuss@mozilla.org > https://mail.mozilla.org/listinfo/es-discuss
Received on Thursday, 23 February 2012 09:24:42 UTC