- From: Norbert Lindenberg <ecmascript@norbertlindenberg.com>
- Date: Thu, 1 Mar 2012 23:09:15 -0800
- To: Allen Wirfs-Brock <allen@wirfs-brock.com>
- Cc: Norbert Lindenberg <ecmascript@norbertlindenberg.com>, Brendan Eich <brendan@mozilla.com>, Wes Garland <wes@page.ca>, "public-script-coord@w3.org" <public-script-coord@w3.org>, mranney@voxer.com, es-discuss <es-discuss@mozilla.org>
Comments: 1) In terms of the prioritization I suggested a few days ago https://mail.mozilla.org/pipermail/es-discuss/2012-February/020721.html it seems you're considering item 6 essential, item 1 a side effect (whose consequences are not mentioned - see below), items 2-5 nice to have. Do I understand that correctly? What is this prioritization based on? 2) The description of the current situation seems incorrect. The strawman says: "As currently specified by ES5.1, supplementary characters cannot be used in the source code of ECMAScript programs." I don't see anything in the spec saying this. To the contrary, the following statement in clause 6 of the spec opens the door to supplementary characters: "If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16." Actual source text outside of an ECMAScript runtime is rarely stored in streams of 16-bit code units; it's normally stored and transmitted in UTF-8 (including its subset ASCII) or some other single-byte or multi-byte character encoding. Interpreting source text therefore almost always requires conversion to UTF-16 as a first step. UTF-8 and several other encodings (GB18030, Big5-HKSCS, EUC-TW) can represent supplementary characters, and correct conversion to UTF-16 will convert them to surrogate pairs. When I mentioned this before, you said that the intent of the ES5 wording was to keep ECMAScript limited to the BMP (the "UCS-2 world"). https://mail.mozilla.org/pipermail/es-discuss/2011-May/014337.html https://mail.mozilla.org/pipermail/es-discuss/2011-May/014342.html However, I don't see that intent reflected in the actual text of clause 6. I have since also tested with supplementary characters in UTF-8 source text on a variety of current browsers (Safari / (Mac, iOS), (Firefox, Chrome, Opera) / (Mac, Windows), Explorer / Windows), and they all handle the conversion from UTF-8 to UTF-16 correctly. Do you know of one that doesn't? The only ECMAScript implementation I encountered that fails here is Node.js. In addition to plain text encoding in UTF-8, supplementary characters can also be represented in source code as a sequence of two Unicode escapes. It's not as convenient, but it works in all implementations I've tested, including Node.js. 3) Changing the source code to be just a stream of Unicode characters seems a good idea overall. However, just changing the definition of SourceCharacter is going to break things. SourceCharacter isn't only used for source syntax and JSON syntax, where the change seems benign; it's also used to define the content of String values and the interpretation of regular expression patterns: - Subclause 7.8.4 contains the statements "The SV of DoubleStringCharacters :: DoubleStringCharacter is a sequence of one character, the CV of DoubleStringCharacter." and "The CV of DoubleStringCharacter :: SourceCharacter but not one of " or \ or LineTerminator is the SourceCharacter character itself." If SourceCharacter becomes a Unicode character, then this means coercing a 21-bit code point into a single 16-bit code unit, and that's not going to end well. - Subclauses 15.10.1 and 15.10.2 use SourceCharacter to define PatternCharacter, IdentityEscape, RegularExpressionNonTerminator, ClassAtomNoDash. While this could potentially be part of a set of changes to make regular expression correctly support full Unicode, by itself it means that 21-bit code points will be coerced into or compared against 16-bit code units. Changing regular expressions to be code-point based has some compatibility risk which we need to carefully evaluate. 4) The statement about UnicodeEscapeSequence: "This production is limited to only expressing 16-bit code point values." is incorrect. Unicode escape sequences express 16-bit code units, not code points (remember that any use of the word "character" without the prefix "Unicode" in the spec after clause 6 means "16-bit code unit"). A supplementary character can be represented in source code as a sequence of two Unicode escapes. The proposed new Unicode escape syntax is more convenient and more legible, but doesn't provide new functionality. 5) I don't understand the sentence "For that reason, it is impossible to know for sure whether pairs of existing 16-bit Unicode escapes are intended to represent a single logical character or an explicit two character UTF-16 encoding of a Unicode characters." - what do you mean by "an explicit two character UTF-16 encoding of a Unicode characters"? In any case, it seems pretty clear to me that a Unicode escape for a high surrogate value followed by a Unicode escape for a low surrogate value, with the spec based on 16-bit values, means a surrogate pair representing a supplementary character. Even if the system were then changed to be 32-bit based, it's hard to imagine that the intent was to create a sequence of two invalid code points. Norbert On Feb 29, 2012, at 19:54 , Allen Wirfs-Brock wrote: > I posted a new stawman that describes what I think should is that most minimal support that we must provide for "full unicode" in ES.next: http://wiki.ecmascript.org/doku.php?id=strawman:full_unicode_source_code > > I'm not suggesting that we must stop at this level of support, but I think not doing at least what is describe in this proposal would would be mistake. > > Thoughts? > > > Allen > > > > On Feb 28, 2012, at 3:49 AM, Brendan Eich wrote: > >> Wes Garland wrote: >>> If four-byte escapes are statically rejected in BRS-on, we have a problem -- we should be able to use old code that runs in either mode unchanged when said code only uses characters in the BMP. >> >> We've been over this and I conceded to Allen that "four-byte escapes" (I'll use \uXXXX to be clear from now on) must work as today with BRS-on. Otherwise we make it hard to impossible to migrate code that knows what it is doing with 16-bit code units that round-trip properly. >> >>> Accepting both 4 and 6 byte escapes is a problem, though -- what is "\u123456".length? 1 or 3? >> >> This is not a problem. We want .length to distribute across concatenation, so 3 is the only answer and in particular ("\u1234" + "\u5678").length === 2 irrespective of BRS. >> >>> If we accept "\u1234" in BRS-on as a string with length 5 -- as we do today in ES5 with "\u123".length===4 -- we give developers a way to feature-test and conditionally execute code, allowing libraries to run with BRS-on and BRS-off. >> >> Feature-testing should be done using a more explicit test. API TBD, but I don't think breaking "\uXXXX" with BRS on is a good idea. >> >> I agree with you that Roozbeh is hardly used, so it can take the hit of having to feature-test the BRS. The much more common case today is JS code that blithely ignores non-BMP characters that make it into strings as pairs, treating them blindly as two "characters" (ugh; must purge that "c-word" abusage from the spec). >> >> /be >> >
Received on Friday, 2 March 2012 07:09:58 UTC