- From: Brendan Eich <brendan@mozilla.com>
- Date: Mon, 20 Feb 2012 10:52:28 -0800
- To: Allen Wirfs-Brock <allen@wirfs-brock.com>
- CC: Gavin Barraclough <barraclough@apple.com>, public-script-coord@w3.org, Anne van Kesteren <annevk@opera.com>, mranney@voxer.com, es-discuss discussion <es-discuss@mozilla.org>
Allen Wirfs-Brock wrote: > For the moment, I'll simply take Wes' word for the above, as it logically makes sense. For some uses, you want to process all possible code points (for example, when validating data from an external source). At this lowest level you don't want to impose higher level Unicode semantic constraints: > > if (stringFromElseWhere.indexOf("\u{d800}")) .... Sorry, I disagree. We have a chance to keep Strings consistent with "full Unicode", or broken into uint16 pieces. There is no self-consistent third way that has 21-bit code points but allows one to jam what up until now have been both code points and code units into code points, where they will be misinterpreted. If someone wants to do data hacking, Binary Data (Typed Arrays) are there (even in IE10pp). >>> Eliminating the "\uXXXX" syntax really doesn't change anything regarding actual string processing. >> True, but not my point! > > but else where you said you would reject String.fromCharCode(0xd800) I'm being consistent (I hope!). I'd reject "\uXXXX" altogether with the BRS set. It's ambiguous at best, or (I argue, and you argue some of the time) it means code units, not code points. We're doing points now, no units, with the BRS set, so it has to go. Same goes for constructive APIs taking (with the BRS set) code points. I see nothing but mischief arising from allowing [D800-DFFF]. Unicode gurus should school us if there's a use-case that can be sanely composed with "full Unicode" and "code points, not units" iteration. > so it sounds to me like you are trying to actually ban the occurrence of 0xd800 as the value of a string element. Under the BRS set to "full Unicode", as a code point, yes. >>> What it might do, however, is eliminate the ambiguity about the intended meaning of "\uD800\uDc00" in legacy code. >> And arising from concatenations, avoiding the loss of Gavin's distributive .length property. > > These aren't the same thing. > > "\ud8000\udc00" is a specific syntactic construct where there must have been a specific user intent in writing it. (One too many 0s there.) We do not want to guess. All I know is that "\ud800\udc00" means what it means today in ECMA-262 and conforming implementations. With the BRS set to "full Unicode", it could be taken to mean two code points, but that results in invalid Unicode and is not backward compatible. It could be read as one code point but that is what "\u{...}" is for and we want anyone migrating such "hardcoded" code into the BRS to check and choose. > Our legacy problem is that the intent becomes ambiguous when that same sequence might be interpreted under different BRS settings. I propose to solve that by forbiding "\uXXXX" when the BRS is set. > str1 + str2 is much less specific and all we know at runtime (assuming either str1 or str2 are strings) is that the user wants to concatenate them. The values might be: > str1= String.fromCharCode(0xd800); > str2=String.fromCharCode(0xddc00); > > and the user might be intentionally constructing a string containing an explicit UTF-16 encoding that is going to be passed off to an external agent that specifically requires UTF-16. Nope, cuz I'm proposing String.fromCharCode calls such as those throw. We should not be making more type-confusion hazards just to play a guessing game that might (but probably won't) preserve some edge-case "hardcoded" surrogate hacking that exists in code on the Web or behind a firewall today. Such code can do what it has always done, unless and until its maintainer throws the BRS. At that point early and runtime errors will provoke rewrite to "\u{...}", and with fromCharCode etc., 21-bit code points that are not reserved for surrogates. > Another way to express what I see as the problem with what you are proposing about imposing such string semantics: > > Could the revised ECMAScript be used to implement a language that had similar but not identical semantic rules to those you are suggested for ES strings. My sense is that if we went down the path you are suggesting, such a implementation would have to use binary data arrays for all of its internal string processing and could not use ES string functions to process them. If you mean a metacircular evaluator, I don't think so. Can you show a counterexample? If you mean a UTF-transcoder, then yes: binary data / typed arrays are required. That's the right answer. /be
Received on Monday, 20 February 2012 18:52:56 UTC