Re: New full Unicode for ES6 idea from Allen Wirfs-Brock on 2012-02-19 (public-script-coord@w3.org from January to March 2012)

From: Allen Wirfs-Brock <allen@wirfs-brock.com>
Date: Sun, 19 Feb 2012 15:13:36 -0800
To: Brendan Eich <brendan@mozilla.com>
Cc: Anne van Kesteren <annevk@opera.com>, "public-script-coord@w3.org" <public-script-coord@w3.org>, mranney@voxer.com, es-discuss <es-discuss@mozilla.org>
Message-Id: <10415A0A-B1D9-4F97-8A2F-2FFCBF4202E0@wirfs-brock.com>

On Feb 19, 2012, at 2:44 PM, Brendan Eich wrote:

> Allen Wirfs-Brock wrote:
>> On Feb 19, 2012, at 2:15 PM, Brendan Eich wrote:
>>> I'm not a Unicode expert but I believe the latter is called "character". 
>> 
>> Me neither, but I believe the correct term is "code point" which refers to the full 21-bit code while "Unicode character" is the logical entity corresponding to that code point.   That usage of "character" is difference from the current usage within ECMAScript where "character" is what we call the elements of the vector of 16-bit number that are used to represent a String value.   You can access then as string values of length 1 via [ ] or as numeric values via the charCodeAt method.
> 
> Thanks. We have a confusing transposition of terms between Unicode and ECMA-262, it seems. Should we fix?

The ES5.1 spec.is ok because it always uses (as defined in section 6)  the term "Unicode character"  when it means exactly that and uses "character" when talking about the elements of String values. It says that both "code unit" and "character"  refer to a 16-bit unsigned value.

Your proposal would change that equivalence. In one sense, the BSR would be a switch that controls whether a ES "character" corresponds to "code unit" or a "code point"

> 
>>> JS must keep the "\uXXXX" notation for uint16 storage units, and one can create invalid Unicode strings already. This hazard does not go away, we keep compatibility, but the BRS adds no new hazards and in practice, if well-used, should reduce the incidence of invalid-Unicode-string bugs.
>>> 
>>> The "\u{...}" notation is independent and should work whatever the BRS setting, IMHO. In "UCS-2" (default) setting, "\u{...}" can make pairs. In "UTF-16" setting, it makes only characters. And of course in the latter case indexing and length count characters.
>> 
>> I think your names for the BRS modes are misleading.
> 
> You got me, in fact I used "full Unicode" for the BRS-thrown setting elsewhere.
> 
> My implementor's bias is showing, because I expect many engines would use UTF-16 internally and have non-O(1) indexing for strings with the contains-non-BMP-and-BRS-set-to-full-Unicode flag bit.

A fine implementation, but not observable.  Another implementation approach that would preserve O(1) indexing would be to simply have two or three different internal string representations with 1, 2, or 4 byte internal characters.  (You can automatically pick the needed character size when the string is created because string are immutable and created with their value).  A not-quite O(1) approach would segment strings into substring spans using such an representation.   Representation choice probably depends a lot on what you think are the most common use cases.  If it is string processing in JS then a fast representations is probably what you want to choose.  If it is just passing text  that is already UTF-8 or UTF-16  encoded from inputs to output then a representation that minimizing transcoding would probably be a higher priority.

Allen

Received on Sunday, 19 February 2012 23:14:13 UTC