Re: New full Unicode for ES6 idea

On Mar 1, 2012, at 11:09 PM, Norbert Lindenberg wrote:

> Comments:
> 
> 1) In terms of the prioritization I suggested a few days ago
> https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html
> it seems you're considering item 6 essential, item 1 a side effect (whose consequences are not mentioned - see below), items 2-5 nice to have. Do I understand that correctly? What is this prioritization based on?

The main intent of this proposal was to push forward with including \u{ }in ES6, regardless of any other on going full Unicode related discussions we are having. Hopefully we can achieve more that that, but  if we don't the inclusion of  \u{ } now should make it easer the next time we attach that problem by reducing the use of \uxxxx\\uxxxx pairs which are ambiguous in intent.  My expectation is that we would tell the world that \{} is the new \uxxxx\uxxxx and that they should avoid using the latter form to inject supplementary characters into strings (and RegExp).

However, that usage depends upon the fact that today's implementations do generally allow supplementary characters to exist in the ECMAScript source code and that they do something rational with them.  ES5 botched this saying that source characters can't exist in ECMAScript source code so we also need to fix that.


> 
> 
> 2) The description of the current situation seems incorrect. The strawman says: "As currently specified by ES5.1, supplementary characters cannot be used in the source code of ECMAScript programs." I don't see anything in the spec saying this. To the contrary, the following statement in clause 6 of the spec opens the door to supplementary characters: "If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16." Actual source text outside of an ECMAScript runtime is rarely stored in streams of 16-bit code units; it's normally stored and transmitted in UTF-8 (including its subset ASCII) or some other single-byte or multi-byte character encoding. Interpreting source text therefore almost always requires conversion to UTF-16 as a first step. UTF-8 and several other encodings (GB18030, Big5-HKSCS, EUC-TW) can represent supplementary characters, and correct conversion to UTF-16 will convert them to surrogate pairs.
> 
> When I mentioned this before, you said that the intent of the ES5 wording was to keep ECMAScript limited to the BMP (the "UCS-2 world").
> https://mail.mozilla.org/pipermail/es-discuss/2011-May/014337.html
> https://mail.mozilla.org/pipermail/es-discuss/2011-May/014342.html
> However, I don't see that intent reflected in the actual text of clause 6.
> 
> I have since also tested with supplementary characters in UTF-8 source text on a variety of current browsers (Safari / (Mac, iOS), (Firefox, Chrome, Opera) / (Mac, Windows), Explorer / Windows), and they all handle the conversion from UTF-8 to UTF-16 correctly. Do you know of one that doesn't? The only ECMAScript implementation I encountered that fails here is Node.js.

http://code.google.com/p/v8/issues/detail?id=761 suggests that V8 truncates supplementary characters rather than converting them to surrogate pairs.  However, it is unclear whether that is referring to literal strings in the source code or only computationally generated strings. 
> 
> In addition to plain text encoding in UTF-8, supplementary characters can also be represented in source code as a sequence of two Unicode escapes. It's not as convenient, but it works in all implementations I've tested, including Node.js.

the main problem is 
  SourceCharacter :: 
     any Unicode code unit

and "...the phrase 'code unit'  and the word 'character' will be used to refer to a 16-bit unsigned value..."

All of the lexical rules in clause 7 are defined in terms of "characters" (ie code units).  So, for example, a supplementary characters in category Lo occurring in an Identifier context would, at best, be seen as a pair of code units neither of which are in categories that are valid for IdentifierPart so the identifier would be invalid.   Similarly a pair of \uXXXX escapes representing such a character would also be lex'ed as two distinct characters and result in an invalid identifier. 

Regarding the intent of the current wording, I' was speaking of my intent when I was actually editing that text for the ES5 spec.  My understanding at the time was that the lexical alphabet of ECMAScript was 16-bit code units and I was trying to clarify that but I think I botched it. In reality, I think that understanding is actually still correct in that there is nothing in the lexical grammar, as I noted in the previous paragraph that deals with anything other than 16-bit code units. Any conversions from non 16-bit character encodings  is something that logically happens prior to processing as "ECMAScript source code".  

> 
> 
> 3) Changing the source code to be just a stream of Unicode characters seems a good idea overall. However, just changing the definition of SourceCharacter is going to break things. SourceCharacter isn't only used for source syntax and JSON syntax, where the change seems benign; it's also used to define the content of String values and the interpretation of regular expression patterns:
> - Subclause 7.8.4 contains the statements "The SV of DoubleStringCharacters :: DoubleStringCharacter is a sequence of one character, the CV of DoubleStringCharacter." and "The CV of DoubleStringCharacter :: SourceCharacter but not one of " or \ or LineTerminator is the SourceCharacter character itself." If SourceCharacter becomes a Unicode character, then this means coercing a 21-bit code point into a single 16-bit code unit, and that's not going to end well.
> - Subclauses 15.10.1 and 15.10.2 use SourceCharacter to define PatternCharacter, IdentityEscape, RegularExpressionNonTerminator, ClassAtomNoDash. While this could potentially be part of a set of changes to make regular expression correctly support full Unicode, by itself it means that 21-bit code points will be coerced into or compared against 16-bit code units. Changing regular expressions to be code-point based has some compatibility risk which we need to carefully evaluate.

Yes, but it isn't clear that it will change anything.  We've just discussed that, in practice, JS implementations accept supplementary  characters in string and RegExp literals. This proposal is saying that however implementatiions treat such characters, they must treat \u{} characters in the same way.

The interesting thing about JSON and eval, is that they take their input form actual JS strings rather than some abstract input source.  The SourceCharacters they currently process correspond to single 16-bit string elements.  Changing the grammar would change that correspondence unless we also change the semantics of string element values.  This proposal leaves that issue for independent consideration.




> 
> 
> 4) The statement about UnicodeEscapeSequence: "This production is limited to only expressing 16-bit code point values." is incorrect. Unicode escape sequences express 16-bit code units, not code points (remember that any use of the word "character" without the prefix "Unicode" in the spec after clause 6 means "16-bit code unit"). A supplementary character can be represented in source code as a sequence of two Unicode escapes. The proposed new Unicode escape syntax is more convenient and more legible, but doesn't provide new functionality.

As I said above, any such surrogate pairs  aren't recognized by the grammar as a Unicode characters.  What I meant by the quoted phrase is something like "This production is limitied to only expressing values in the 16-bit subset of code point values".
> 
> 
> 5) I don't understand the sentence "For that reason, it is impossible to know for sure whether pairs of existing 16-bit Unicode escapes are intended to represent a single logical character or an explicit two character UTF-16 encoding of a Unicode characters." - what do you mean by "an explicit two character UTF-16 encoding of a Unicode characters"? In any case, it seems pretty clear to me that a Unicode escape for a high surrogate value followed by a Unicode escape for a low surrogate value, with the spec based on 16-bit values, means a surrogate pair representing a supplementary character. Even if the system were then changed to be 32-bit based, it's hard to imagine that the intent was to create a sequence of two invalid code points.

We don't know if the intent is to explicit construct a UTF-16 encoded string that is to be passed to a consumer that demands UTF-16 encoding.  Or if the intent is simply to logically express a specific supplementary characters in a context where the internal encoding isn't known of relevent. ES5 doesn't have a way to distinguish those two use cases.

Allen

Received on Friday, 2 March 2012 16:32:27 UTC