Re: New full Unicode for ES6 idea from Brendan Eich on 2012-02-20 (public-script-coord@w3.org from January to March 2012)

From: Brendan Eich <brendan@mozilla.com>
Date: Sun, 19 Feb 2012 19:52:23 -0800
To: Gavin Barraclough <barraclough@apple.com>
CC: Allen Wirfs-Brock <allen@wirfs-brock.com>, public-script-coord@w3.org, Anne van Kesteren <annevk@opera.com>, mranney@voxer.com, es-discuss discussion <es-discuss@mozilla.org>
Message-ID: <4F41C377.20505@mozilla.com>

Gavin Barraclough wrote:
> One way in which the proposal under discussion seems to differ from 
> the previous strawman is in the behavior arising from concatenation of 
> strings ending/beginning with a surrogate hi and lo element.
> How do we want to handle how do we want to handle unpaired UTF-16 
> surrogates in a full-unicode string?  I can see three options:
>
> 1) Prohibit values from strings that do not map to valid unicode 
> characters (either throw an exception, or replace with the unicode 
> replacement character).
> 2) Allow invalid unicode characters in strings, and preserve them over 
> concatenation – ("\uD800" + "\uDC00").length == 2.
> 3) Allow invalid unicode characters in strings, but allow surrogate 
> pairs to fuse over concatenation – ("\uD800" + "\uDC00").length == 1.
>
> It seems desirable for full-unicode strings to logically be a sequence 
> of unicode characters, stored and processed in a encoding-agnostic 
> manner.  option 3 would seem to violate that, exposing the underlying 
> UTF-16 implementation.  It also loses a distributive property of 
> .length over concatenation that I believe is true in ES5 for strings, 
> in that currently for all strings s1 & s2:
> s1.length + s2.length == (s1 + s2).length
> However if we allow concatenation to fuse surrogate pairs into a 
> single character (e.g. s1 = "\uD800", s2 = "\uDC00") this will no 
> longer be true.
>
> I guess I wonder if it's worth considering either options 1) or 2) – 
> either prohibiting invalid unicode characters in strings, or consider 
> something closer to the previous strawman, where string storage is 
> defined to be 32-bit (with a BRS that instead of changing iteration 
> would change string creation, introducing an implicit UTF16-UTF32 
> conversion).

Great post. I agree 3 is not good. I was thinking based on today's 
exchanges that the BRS being set to "full Unicode" *could* mean that 
"\uXXXX" is illegal and you *must* use "\u{...}" to write Unicode *code 
points* (not code units).

Last year we dispensed with the binary data hacking in strings use-case. 
I don't see the hardship. But rather than throw exceptions on 
concatenation I would simply eliminate the ability to spell code units 
with "\uXXXX" escapes. Who's with me?

/be

Received on Monday, 20 February 2012 03:52:51 UTC