Re: New full Unicode for ES6 idea from Allen Wirfs-Brock on 2012-02-19 (public-script-coord@w3.org from January to March 2012)

From: Allen Wirfs-Brock <allen@wirfs-brock.com>
Date: Sun, 19 Feb 2012 14:40:52 -0800
To: Brendan Eich <brendan@mozilla.com>
Cc: Anne van Kesteren <annevk@opera.com>, "public-script-coord@w3.org" <public-script-coord@w3.org>, mranney@voxer.com, es-discuss <es-discuss@mozilla.org>
Message-Id: <FBB993E4-E341-4666-8DB3-887D0F6DE34F@wirfs-brock.com>

On Feb 19, 2012, at 2:15 PM, Brendan Eich wrote:

> Anne van Kesteren wrote:
>> ...
> 
>> As far as the DOM and Web IDL are concerned, I think we would need two definitions for "code unit". One that means 16-bit code unit and one that means "Unicode code unit"
> 
> I'm not a Unicode expert but I believe the latter is called "character".

Me neither, but I believe the correct term is "code point" which refers to the full 21-bit code while "Unicode character" is the logical entity corresponding to that code point.   That usage of "character" is difference from the current usage within ECMAScript where "character" is what we call the elements of the vector of 16-bit number that are used to represent a String value.   You can access then as sting values of length 1 via [ ] or as numeric values via the charCodeAt method.

> 
>> or some such. Looking at http://dvcs.w3.org/hg/domcore/raw-file/tip/Overview.html#characterdata the rest should follow quite naturally.
>> 
>> What happens with surrogate code points in these new strings? I think we do not want to change that each unit is an integer of some kind and can be set to any value. And if that is the case, will it hold values greater than U+10FFFF?
> 
> JS must keep the "\uXXXX" notation for uint16 storage units, and one can create invalid Unicode strings already. This hazard does not go away, we keep compatibility, but the BRS adds no new hazards and in practice, if well-used, should reduce the incidence of invalid-Unicode-string bugs.
> 
> The "\u{...}" notation is independent and should work whatever the BRS setting, IMHO. In "UCS-2" (default) setting, "\u{...}" can make pairs. In "UTF-16" setting, it makes only characters. And of course in the latter case indexing and length count characters.

I think your names for the BRS modes are misleading.  What you call "UTF-16" actually manifests itself to the ES programmer as UTF-32 as each index position within a string corresponds to a unencoded Unicode code point.  There are no visible UTF-16 surrogate pairs, even if the implementation is internally using a UTF-16 encoding. 

Similarly, "UCS-2" as currently implemented actually manifests itself to the ES programmer as UTF-16 because implementations turn non-BMP string literal characters into UTF-16 surrogate pairs that visibly occupy two index positions.

Allen

Received on Sunday, 19 February 2012 22:41:28 UTC