Re: New full Unicode for ES6 idea from Brendan Eich on 2012-02-21 (public-script-coord@w3.org from January to March 2012)

From: Brendan Eich <brendan@mozilla.com>
Date: Tue, 21 Feb 2012 09:55:15 -0800
To: "Phillips, Addison" <addison@lab126.com>
CC: Wes Garland <wes@page.ca>, Allen Wirfs-Brock <allen@wirfs-brock.com>, "public-script-coord@w3.org" <public-script-coord@w3.org>, Anne van Kesteren <annevk@opera.com>, "mranney@voxer.com" <mranney@voxer.com>, es-discuss discussion <es-discuss@mozilla.org>
Message-ID: <4F43DA83.70105@mozilla.com>

Phillips, Addison wrote:
>
> Because it has always been possible, it’s difficult to say how many 
> scripts have transported byte-oriented data by “punning” the data into 
> strings. Actually, I think this is more likely to be truly binary data 
> rather than text in some non-Unicode character encoding, but anything 
> is possible, I suppose. This could include using non-character values 
> like “FFFE”, “FFFF” in addition to the surrogates. A BRS-running 
> implementation would break a script that relied on String being a 
> sequence of 16-bit unsigned integer values with no error checking.
>

Allen's view of the BRS-enabled semantics would have 16-bit "GIGO" 
without exceptions -- you'd be storing 16-bit values, whatever their 
source (including "\uXXXX" literals spelling invalid characters and 
unmatched surrogates) in at-least-21-bit elements of strings, and 
reading them back.

My concern and reason for advocating early or late errors on shenanigans 
was that people today writing surrogate pais literally and then taking 
extra pains in JS or C++ (whatever the host language might be) to 
process them as single code points and characters would be broken by the 
BRS-enabled behavior of separating the parts into distinct code points.

But that's pessimistic. It could happen, but OTOH anyone coding 
surrogate pairs might want them to read back piece-wise when indexing. 
In that case what Allen proposes, storing each formerly 16-bit code 
unit, however expressed, in the wider 21-or-more-bits unit, and reading 
back likewise, would "just work".

Sorry if this is all obvious. Mainly I want to throw in my lot with 
Allen's exception-free literal/constructor approach. The encoding APIs 
should throw on invalid Unicode but literals and strings as immutable 
16-bit storage buffers should work as today.

/be

Received on Tuesday, 21 February 2012 17:55:45 UTC