Re: New full Unicode for ES6 idea from Norbert Lindenberg on 2012-03-02 (public-script-coord@w3.org from January to March 2012)

From: Norbert Lindenberg <ecmascript@norbertlindenberg.com>
Date: Thu, 1 Mar 2012 23:09:15 -0800
To: Allen Wirfs-Brock <allen@wirfs-brock.com>
Cc: Norbert Lindenberg <ecmascript@norbertlindenberg.com>, Brendan Eich <brendan@mozilla.com>, Wes Garland <wes@page.ca>, "public-script-coord@w3.org" <public-script-coord@w3.org>, mranney@voxer.com, es-discuss <es-discuss@mozilla.org>
Message-Id: <B03378FE-87E2-481D-B5AF-7131C653CEC7@norbertlindenberg.com>
Comments:

1) In terms of the prioritization I suggested a few days ago
https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
it seems you're considering item 6 essential, item 1 a side effect (whose consequences are not mentioned - see below), items 2-5 nice to have. Do I understand that correctly? What is this prioritization based on?


2) The description of the current situation seems incorrect. The strawman says: "As currently specified by ES5.1, supplementary characters cannot be used in the source code of ECMAScript programs." I don't see anything in the spec saying this. To the contrary, the following statement in clause 6 of the spec opens the door to supplementary characters: "If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16." Actual source text outside of an ECMAScript runtime is rarely stored in streams of 16-bit code units; it's normally stored and transmitted in UTF-8 (including its subset ASCII) or some other single-byte or multi-byte character encoding. Interpreting source text therefore almost always requires conversion to UTF-16 as a first step. UTF-8 and several other encodings (GB18030, Big5-HKSCS, EUC-TW) can represent supplementary characters, and correct conversion to UTF-16 will convert them to surrogate pairs.

When I mentioned this before, you said that the intent of the ES5 wording was to keep ECMAScript limited to the BMP (the "UCS-2 world").
https://mail.mozilla.org/pipermail/es-discuss/2011-May/014337.html
https://mail.mozilla.org/pipermail/es-discuss/2011-May/014342.html
However, I don't see that intent reflected in the actual text of clause 6.

I have since also tested with supplementary characters in UTF-8 source text on a variety of current browsers (Safari / (Mac, iOS), (Firefox, Chrome, Opera) / (Mac, Windows), Explorer / Windows), and they all handle the conversion from UTF-8 to UTF-16 correctly. Do you know of one that doesn't? The only ECMAScript implementation I encountered that fails here is Node.js.

In addition to plain text encoding in UTF-8, supplementary characters can also be represented in source code as a sequence of two Unicode escapes. It's not as convenient, but it works in all implementations I've tested, including Node.js.


3) Changing the source code to be just a stream of Unicode characters seems a good idea overall. However, just changing the definition of SourceCharacter is going to break things. SourceCharacter isn't only used for source syntax and JSON syntax, where the change seems benign; it's also used to define the content of String values and the interpretation of regular expression patterns:
- Subclause 7.8.4 contains the statements "The SV of DoubleStringCharacters :: DoubleStringCharacter is a sequence of one character, the CV of DoubleStringCharacter." and "The CV of DoubleStringCharacter :: SourceCharacter but not one of " or \ or LineTerminator is the SourceCharacter character itself." If SourceCharacter becomes a Unicode character, then this means coercing a 21-bit code point into a single 16-bit code unit, and that's not going to end well.
- Subclauses 15.10.1 and 15.10.2 use SourceCharacter to define PatternCharacter, IdentityEscape, RegularExpressionNonTerminator, ClassAtomNoDash. While this could potentially be part of a set of changes to make regular expression correctly support full Unicode, by itself it means that 21-bit code points will be coerced into or compared against 16-bit code units. Changing regular expressions to be code-point based has some compatibility risk which we need to carefully evaluate.


4) The statement about UnicodeEscapeSequence: "This production is limited to only expressing 16-bit code point values." is incorrect. Unicode escape sequences express 16-bit code units, not code points (remember that any use of the word "character" without the prefix "Unicode" in the spec after clause 6 means "16-bit code unit"). A supplementary character can be represented in source code as a sequence of two Unicode escapes. The proposed new Unicode escape syntax is more convenient and more legible, but doesn't provide new functionality.


5) I don't understand the sentence "For that reason, it is impossible to know for sure whether pairs of existing 16-bit Unicode escapes are intended to represent a single logical character or an explicit two character UTF-16 encoding of a Unicode characters." - what do you mean by "an explicit two character UTF-16 encoding of a Unicode characters"? In any case, it seems pretty clear to me that a Unicode escape for a high surrogate value followed by a Unicode escape for a low surrogate value, with the spec based on 16-bit values, means a surrogate pair representing a supplementary character. Even if the system were then changed to be 32-bit based, it's hard to imagine that the intent was to create a sequence of two invalid code points.

Norbert


On Feb 29, 2012, at 19:54 , Allen Wirfs-Brock wrote:

> I posted a new stawman that describes what I think should is that most minimal support that we must provide for "full unicode" in ES.next: http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code 
> 
> I'm not suggesting that we must stop at this level of support, but I think not doing at least what is describe in this proposal would would be mistake.
> 
> Thoughts?
> 
> 
> Allen
> 
> 
> 
> On Feb 28, 2012, at 3:49 AM, Brendan Eich wrote:
> 
>> Wes Garland wrote:
>>> If four-byte escapes are statically rejected in BRS-on, we have a problem -- we should be able to use old code that runs in either mode unchanged when said code only uses characters in the BMP.
>> 
>> We've been over this and I conceded to Allen that "four-byte escapes" (I'll use \uXXXX to be clear from now on) must work as today with BRS-on. Otherwise we make it hard to impossible to migrate code that knows what it is doing with 16-bit code units that round-trip properly.
>> 
>>> Accepting both 4 and 6 byte escapes is a problem, though -- what is "\u123456".length?  1 or 3?
>> 
>> This is not a problem. We want .length to distribute across concatenation, so 3 is the only answer and in particular ("\u1234" + "\u5678").length === 2 irrespective of BRS.
>> 
>>> If we accept "\u1234" in BRS-on as a string with length 5 -- as we do today in ES5 with "\u123".length===4 -- we give developers a way to feature-test and conditionally execute code, allowing libraries to run with BRS-on and BRS-off.
>> 
>> Feature-testing should be done using a more explicit test. API TBD, but I don't think breaking "\uXXXX" with BRS on is a good idea.
>> 
>> I agree with you that Roozbeh is hardly used, so it can take the hit of having to feature-test the BRS. The much more common case today is JS code that blithely ignores non-BMP characters that make it into strings as pairs, treating them blindly as two "characters" (ugh; must purge that "c-word" abusage from the spec).
>> 
>> /be
>> 
>
Received on Friday, 2 March 2012 07:09:58 UTC