Re: New full Unicode for ES6 idea

On Feb 19, 2012, at 7:52 PM, Brendan Eich wrote:

> Gavin Barraclough wrote:
>> One way in which the proposal under discussion seems to differ from the previous strawman is in the behavior arising from concatenation of strings ending/beginning with a surrogate hi and lo element.
>> How do we want to handle how do we want to handle unpaired UTF-16 surrogates in a full-unicode string?  I can see three options:
>> 
>> 1) Prohibit values from strings that do not map to valid unicode characters (either throw an exception, or replace with the unicode replacement character).
>> 2) Allow invalid unicode characters in strings, and preserve them over concatenation – ("\uD800" + "\uDC00").length == 2.
>> 3) Allow invalid unicode characters in strings, but allow surrogate pairs to fuse over concatenation – ("\uD800" + "\uDC00").length == 1.
>> 
>> It seems desirable for full-unicode strings to logically be a sequence of unicode characters, stored and processed in a encoding-agnostic manner.  option 3 would seem to violate that, exposing the underlying UTF-16 implementation.  It also loses a distributive property of .length over concatenation that I believe is true in ES5 for strings, in that currently for all strings s1 & s2:
>> s1.length + s2.length == (s1 + s2).length
>> However if we allow concatenation to fuse surrogate pairs into a single character (e.g. s1 = "\uD800", s2 = "\uDC00") this will no longer be true.
>> 
>> I guess I wonder if it's worth considering either options 1) or 2) – either prohibiting invalid unicode characters in strings, or consider something closer to the previous strawman, where string storage is defined to be 32-bit (with a BRS that instead of changing iteration would change string creation, introducing an implicit UTF16-UTF32 conversion).
> 
> Great post. I agree 3 is not good. I was thinking based on today's exchanges that the BRS being set to "full Unicode" *could* mean that "\uXXXX" is illegal and you *must* use "\u{...}" to write Unicode *code points* (not code units).
> 
> Last year we dispensed with the binary data hacking in strings use-case. I don't see the hardship. But rather than throw exceptions on concatenation I would simply eliminate the ability to spell code units with "\uXXXX" escapes. Who's with me?

I think we need to be careful not to equate the syntax of ES string literals with the actual encoding space of string elements.  Whether you say "\ud800" or "\u{00d800}", or call a function that does full-unicode to UTF-16 encoding, or simply create a string from file contents you may end up with string elements containing upper or lower half surrogates.    Eliminating the "\uXXXX" syntax really doesn't change anything regarding actual string processing. 

What it might do, however, is eliminate the ambiguity about the intended meaning of  "\uD800\uDc00" in legacy code.  If "full unicode string mode" only supported \u{} escapes then existing code that uses \uXXXX would have to be updated before it could be used in that mode.  That might be a good thing.

Allen









> 
> /be
> 

Received on Monday, 20 February 2012 06:05:39 UTC