Re: New full Unicode for ES6 idea from Brendan Eich on 2012-02-19 (public-script-coord@w3.org from January to March 2012)

From: Brendan Eich <brendan@mozilla.com>
Date: Sun, 19 Feb 2012 13:34:05 -0800
To: Wes Garland <wes@page.ca>
CC: es-discuss <es-discuss@mozilla.org>, "public-script-coord@w3.org" <public-script-coord@w3.org>, mranney@voxer.com
Message-ID: <4F416ACD.4000206@mozilla.com>

Wes Garland wrote:
> Is there a proposal for interaction with JSON?

 From http://www.ietf.org/rfc/rfc4627, 2.5:

    To escape an extended character that is not in the Basic Multilingual
    Plane, the character is represented as a twelve-character sequence,
    encoding the UTF-16 surrogate pair.  So, for example, a string
    containing only the G clef character (U+1D11E) may be represented as
    "\uD834\uDD1E".

>
>     Also because inter-compartment traffic is (we conjecture)
>     infrequent enough to tolerate the proxy/copy overhead.
>
>
> Not to mention that the only thing you'd have to do is to tweak 
> [[get]], charCodeAt and .length when crossing boundaries; you can keep 
> the same backing store.

String methods are not generally self-hosted, so internal C++ vector 
access would need to change depending on the string's flag bit, in this 
implementation approach.

> You might not even need to do this is the engine keeps the same 
> backing store for both kinds of strings.

Yes, sharing the uint16 vector is good. But string methods would have to 
index and .length differently (if I can verb .length ;-).

>     This means a script intent on comparing strings from two globals
>     with different BRS settings could indeed tell that one discloses
>     non-BMP char/codes, e.g. charCodeAt return values >= 0x10000. This
>     is the *small* new observable I claim we can live with, because
>     someone opted into it at least in one of the related global objects.
>
>
> Funny question, if I have two strings, both "hello", from two globals 
> with different BRS settings,  are they ==? How about ===?

Of course, strings with the same characters are == and ===. Strings 
appear to be values. If you think of them as immutable reference types 
there's still an obligation to compare characters for strings because 
computed strings are not intern'ed.

>     R1. To keep compatibility with DOM APIs, the DOM glue used to
>     mediate calls from JS to (typically) C++ would have to proxy or
>     copy any strings containing non-BMP characters. Strings with only
>     BMP characters would work as today.
>
>
> Is that true if the "full unicode" backing store is 16-bit code units 
> using UTF-16 encoding?  (Any way, it's an implementation detail)

Yes, because DOMString has intrinsic length and indexing notions and 
these must (pending any coordination with w3c) remain ignorant of the 
BRS and livin' in the '90s (DOM too emerged in the UCS-2 era).

/be

Received on Sunday, 19 February 2012 21:34:38 UTC