Re: New full Unicode for ES6 idea from Wes Garland on 2012-02-20 (public-script-coord@w3.org from January to March 2012)

From: Wes Garland <wes@page.ca>
Date: Mon, 20 Feb 2012 07:45:38 -0500
To: Brendan Eich <brendan@mozilla.com>
Cc: es-discuss <es-discuss@mozilla.org>, "public-script-coord@w3.org" <public-script-coord@w3.org>, mranney@voxer.com
Message-ID: <CAHB0tE7_kdXSidT+fWEP8gUeb=i-putFogKuyfMJyUAPiSwksA@mail.gmail.com>

On 19 February 2012 16:34, Brendan Eich <brendan@mozilla.com> wrote:

> Wes Garland wrote:
>
>> Is there a proposal for interaction with JSON?
>>
>
> From http://www.ietf.org/rfc/rfc4627, 2.5
>

*snip* - so the proposal is to keep encoding JSON in UTF-16.  What happens
if the BRS is set to Unicode and we want to encode the string
"\uD834\uDD1E" -- the Unicode string which contains two reserved code
points? We do not want to deserialize this as U+1D11E.

I think we should consider that BRS-on should mean six-character escapes in
JSON for non-BMP characters.  It might even be possible to add matching
support for JSON.parse() when BRS-off.  The one caveat is that might make
JSON interchange fragile between BRS-on systems and ES5 engines.

Yes, sharing the uint16 vector is good. But string methods would have to
> index and .length differently (if I can verb .length ;-).
>

.lengthing is easy; cost is about the same as strlen() and can be cached.
Indexed access is something I have thought about from the implementor's POV
for a while [but not heavily].  I haven't come up with a ground-breaking
technique, I keep coming up with something that looks like a lookup table
for surrogate pairs, degrading to an extra uint32[] when there are many of
them. Anyhow, implementation detail.

> Of course, strings with the same characters are == and ===. Strings appear
> to be values. If you think of them as immutable reference types there's
> still an obligation to compare characters for strings because computed
> strings are not intern'ed.
>

What about strings with the same sequence of code units but different code
points? They would have identical backing stores if the backing store were
either UTF-8 or uint32. This can happen if we have BRS-on Strings which
contain non-BMP code points.    (Actually, does BRS-on mean that we have to
abandon UTF-16 to store Unicode strings containing invalid code points?
Mark Davis, are you reading?)

How about strings which are considered equal by Unicode but which do not
share the same representation? Will Unicode normalization be performed when
Strings are created/parsed? On comparison? If on compare, would we skip
normalization for ===?

I assume normalizing to NFC form, similar to what W3C does, is the target?

http://www.macchiato.com/unicode/nfc-faq  (Mark Davis)
http://unicode.org/faq/normalization.html

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102

Received on Monday, 20 February 2012 12:46:10 UTC