W3C home > Mailing lists > Public > public-script-coord@w3.org > January to March 2012

Re: New full Unicode for ES6 idea

From: Brendan Eich <brendan@mozilla.com>
Date: Mon, 20 Feb 2012 10:52:28 -0800
Message-ID: <4F42966C.5060602@mozilla.com>
To: Allen Wirfs-Brock <allen@wirfs-brock.com>
CC: Gavin Barraclough <barraclough@apple.com>, public-script-coord@w3.org, Anne van Kesteren <annevk@opera.com>, mranney@voxer.com, es-discuss discussion <es-discuss@mozilla.org>
Allen Wirfs-Brock wrote:
> For the moment, I'll simply take Wes' word for the above, as it logically makes sense.  For some uses, you want to process all possible code points (for example, when validating data from an external source).  At this lowest level you don't want to impose higher level Unicode semantic constraints:
>
>         if (stringFromElseWhere.indexOf("\u{d800}")) ....

Sorry, I disagree. We have a chance to keep Strings consistent with 
"full Unicode", or broken into uint16 pieces. There is no 
self-consistent third way that has 21-bit code points but allows one to 
jam what up until now have been both code points and code units into 
code points, where they will be misinterpreted.

If someone wants to do data hacking, Binary Data (Typed Arrays) are 
there (even in IE10pp).

>>>      Eliminating the "\uXXXX" syntax really doesn't change anything regarding actual string processing.
>> True, but not my point!
>
> but else where you said you would reject String.fromCharCode(0xd800)

I'm being consistent (I hope!). I'd reject "\uXXXX" altogether with the 
BRS set. It's ambiguous at best, or (I argue, and you argue some of the 
time) it means code units, not code points. We're doing points now, no 
units, with the BRS set, so it has to go.

Same goes for constructive APIs taking (with the BRS set) code points. I 
see nothing but mischief arising from allowing [D800-DFFF]. Unicode 
gurus should school us if there's a use-case that can be sanely composed 
with "full Unicode" and "code points, not units" iteration.

> so it sounds to me like you are trying to actually ban the occurrence of 0xd800 as the value of a string element.

Under the BRS set to "full Unicode", as a code point, yes.

>>> What it might do, however, is eliminate the ambiguity about the intended meaning of  "\uD800\uDc00" in legacy code.
>> And arising from concatenations, avoiding the loss of Gavin's distributive .length property.
>
> These aren't the same thing.
>
>     "\ud8000\udc00" is a specific syntactic construct where there must have been a specific user intent in writing it.

(One too many 0s there.)

We do not want to guess. All I know is that "\ud800\udc00" means what it 
means today in ECMA-262 and conforming implementations. With the BRS set 
to "full Unicode", it could be taken to mean two code points, but that 
results in invalid Unicode and is not backward compatible. It could be 
read as one code point but that is what "\u{...}" is for and we want 
anyone migrating such "hardcoded" code into the BRS to check and choose.

>   Our legacy problem is that the intent becomes ambiguous when that same sequence might be interpreted under different BRS settings.

I propose to solve that by forbiding "\uXXXX" when the BRS is set.

>     str1 + str2 is much less specific and all we know at runtime (assuming either str1 or str2 are strings) is that the user wants to concatenate them.   The values might be:
>         str1= String.fromCharCode(0xd800);
>         str2=String.fromCharCode(0xddc00);
>
> and the user might be intentionally constructing a string containing an explicit UTF-16 encoding that is going to be passed off to an external agent that specifically requires UTF-16.

Nope, cuz I'm proposing String.fromCharCode calls such as those throw.

We should not be making more type-confusion hazards just to play a 
guessing game that might (but probably won't) preserve some edge-case 
"hardcoded" surrogate hacking that exists in code on the Web or behind a 
firewall today. Such code can do what it has always done, unless and 
until its maintainer throws the BRS. At that point early and runtime 
errors will provoke rewrite to "\u{...}", and with fromCharCode etc., 
21-bit code points that are not reserved for surrogates.

> Another way to express what I see as the problem with what you are proposing about imposing such string semantics:
>
> Could the revised ECMAScript be used to implement a language that had similar but not identical semantic rules to those you are suggested for ES strings.  My sense is that if we went down the path you are suggesting, such a implementation would have to use binary data arrays for all of its internal string processing and could not use ES string functions to process them.

If you mean a metacircular evaluator, I don't think so. Can you show a 
counterexample?

If you mean a UTF-transcoder, then yes: binary data / typed arrays are 
required. That's the right answer.

/be
Received on Monday, 20 February 2012 18:52:56 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 8 May 2013 19:30:05 UTC