W3C home > Mailing lists > Public > public-script-coord@w3.org > January to March 2012

Re: New full Unicode for ES6 idea

From: Wes Garland <wes@page.ca>
Date: Sun, 19 Feb 2012 10:06:19 -0500
Message-ID: <CAHB0tE7_RchXUqtx=Bm3zwkfvOe3QfdYCZorfXMgME_pftckcQ@mail.gmail.com>
To: Brendan Eich <brendan@mozilla.com>
Cc: es-discuss <es-discuss@mozilla.org>, "public-script-coord@w3.org" <public-script-coord@w3.org>, mranney@voxer.com
On 19 February 2012 03:33, Brendan Eich <brendan@mozilla.com> wrote:

> S1 dates from when Unicode fit in 16 bits, and in those days, nickels had
> pictures of bumblebees on 'em ("Gimme five bees for a quarter", you'd say
> ;-).
>

Say, is that an onion on your belt?


> * indexing by characters, not uint16 storage units;
> * counting length as one greater than the last index; and
>

These are the two items that IME trip up developers who are either not
careful or not aware of UTF-16 encoding details and don't test with non-BMP
input.  Frankly, JS developers should not have to be aware of character
encodings. Strings should "just work".

I think that explicitly making strings Unicode and applying the fix above
would solve a *lot* of problems.  If I had this option, I would go so far
as to throw the BRS in my build processes, hg grep all our source code for
strings like D800 and eliminate all the extra UTF-16 machinations.

Another option might be to make ES.next have full Unicode strings; fix
..length and .charCodeAt etc when we are in ES.next context, leaving them
"broken" otherwise.  I'm not fond of this option, though: since there would
be no BRS, developers might often find themselves unsure of just what the
heck it is they are working with.

So, I like per-global BRS.

* supporting escapes with (up to) six hexadecimal digits.
>

This is necessary too; developers should be thinking about code points, not
encoding details.


> P2. The change is not backward compatible. In JS today, one read a string
> s from somewhere and hard-code, e.g., s.indexOf("0xd800" to find part of a
> surrogate pair, then advance to the next-indexed uint16 unit and read the
> other half, then combine to compute some result. Such usage would break.
>

While that is true in the general case, there are many specific cases where
that would not break. I'm thinking I have an implementation of
UnicodeStrlen around here somewhere which works by subracting the number of
0xD800 characters from .length.  In this case, that code would continue to
generate correct length counts because it would never find a 0xD800 in a
valid Unicode string (it's a reserved code point).


> We also wish to avoid exposing a "full Unicode" representation type and
> duplicated suite of the String static and prototype methods, as Java did.
> (We may well want UTF-N transcoding helpers; we certainly want ByteArray
> <-> UTF-8 transcoding APIs.)
>

These are both good goals, in particular, avoiding a "full Unicode" type
means reducing bug counts in the long term.

Is there a proposal for interaction with JSON?


> Also because inter-compartment traffic is (we conjecture) infrequent
> enough to tolerate the proxy/copy overhead.
>

Not to mention that the only thing you'd have to do is to tweak [[get]],
charCodeAt and .length when crossing boundaries; you can keep the same
backing store.

You might not even need to do this is the engine keeps the same backing
store for both kinds of strings.


> This means a script intent on comparing strings from two globals with
> different BRS settings could indeed tell that one discloses non-BMP
> char/codes, e.g. charCodeAt return values >= 0x10000. This is the *small*
> new observable I claim we can live with, because someone opted into it at
> least in one of the related global objects.
>

Funny question, if I have two strings, both "hello", from two globals with
different BRS settings,  are they ==? How about ===?


> R1. To keep compatibility with DOM APIs, the DOM glue used to mediate
> calls from JS to (typically) C++ would have to proxy or copy any strings
> containing non-BMP characters. Strings with only BMP characters would work
> as today.
>

Is that true if the "full unicode" backing store is 16-bit code units using
UTF-16 encoding?  (Any way, it's an implementation detail)

In particular, Node.js can get modern at startup, and perhaps engines such
> as V8 as used in Node could even support compile-time (#ifdef) configury by
> which to support only full Unicode.
>

Sure, this is analogous to how SpiderMonkey deals with UTF-8 C Strings.
Flip a BRS before creating the runtime. :)

Wes

-- 
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
Received on Monday, 20 February 2012 08:50:39 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 8 May 2013 19:30:05 UTC