Re: New full Unicode for ES6 idea from Brendan Eich on 2012-02-21 (public-script-coord@w3.org from January to March 2012)

From: Brendan Eich <brendan@mozilla.com>
Date: Mon, 20 Feb 2012 21:03:03 -0800
To: Allen Wirfs-Brock <allen@wirfs-brock.com>
CC: Gavin Barraclough <barraclough@apple.com>, public-script-coord@w3.org, Anne van Kesteren <annevk@opera.com>, mranney@voxer.com, es-discuss discussion <es-discuss@mozilla.org>
Message-ID: <4F432587.9020107@mozilla.com>
Allen Wirfs-Brock wrote:
> On Feb 20, 2012, at 3:14 PM, Brendan Eich wrote:
> Note that the above say "invalid Unicode code point". 0xd800 is a 
> valid Unicode code point. It isn't a valid Unicode characters.
>
> See 
> http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isValidCodePoint(int) 
> <http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isValidCodePoint%28int%29> 
>
>
>     Determines whether the specified code point is a valid Unicode
>     code point value in the range of |0x0000| to |0x10FFFF| inclusive.
>     This method is equivalent to the expression:
>
>           codePoint>= 0x0000&&  codePoint<= 0x10FFFF
>

I should have remembered this, from the old days of Java and JS talking 
(LiveConnect). Strike one for me.

>> If that's true then we should have enough evidence that I'll happily 
>> concede the point and the spec will allow "uD800" etc. in BRS-enabled 
>> literals. I do not see such evidence.
>
> Note my concern isn't so much about literals as it is about string 
> elements created via String.fromCharCode
>
> The only String.prototype method algorithms seem to have any Unicode 
> dependencies are toLowerCase/toUpperCase and the locale variants of 
> those methods and perhaps localeCompare, trim (knowns Unicode white 
> space character classification, and the regular expression based 
> methods if the regexp is constructed with literal chars or uses 
> character classes.
>
> All concat, splice, splice, substring, indexOf/lastIndexOf, non-regexp 
> based replace and split calls all are defined in terms of string 
> element value comparisons and don't really care about what characters 
> set is used.
>
> Wes Garland mentioned the possibility of using non-Unicode character 
> sets such as Big5

These are byte-based enodings, no? What is the problem inflating them by 
zero extension to 16 bits now (or 21 bits in the future)? You can't make 
an invalid Unicode character from a byte value.

Anyway, Big5 punned into JS strings (via a C or C++ API?) is *not* a 
strong use-case for ignoring invalid characters.

Ball one. :-P

>> Python: 
>> http://docs.python.org/release/3.0.1/howto/unicode.html#the-string-type 
>> -- lots here, similar to Ruby 1.9 (see below) but not obviously in 
>> need of invalid Unicode stored in uint16 vectors handled as JS strings.
>>
>
> I don't see any restrictions on inserting in that doc about strings 
> containing \ud800 and friends. Unless there are, BRS enabled ES 
> strings couldn't be used as the representation type for python strings.

You're right, you can make a literal in Python 3 such as '\ud800' 
without error. Strike two.

>> Ruby: Ruby supports strings with multiple encodings; the encoding is 
>> part of the string's metadata. I am not the expert here, but I found
>>
>> http://www.tbray.org/ongoing/When/200x/2008/09/18/Ruby-I18n
>>
>> helpful, and the more recent
>>
>> http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/
>>
>> too. See also these very interesting posts from Sam Ruby in 2007:
>>
>> http://intertwingly.net/blog/2007/12/28/3-1-2
>> http://intertwingly.net/blog/2007/12/29/Ruby-1-9-Strings-Updated
>>
>> Ruby raises exceptions when you mix two strings with different 
>> encodings incorrectly, e.g. by concatenation.
>>
>> I'm not sure about surrogate validation, but from all this I gather 
>> that compiling Ruby to JS in full would need Uint8Array, along with 
>> lots of runtime helpers that do not come for free from JS's 
>> String.prototype methods, in order to handle all the encodings.
>> \\
>>> If Type arrays are going to be the new "string" type (maybe better 
>>> stated as array of chars) for people doing systems programming in JS 
>>> then we should probably start thinking about a broader set of 
>>> utility functions/methods that support them.
>
> But, using current 16-bit JS string semantics a JS string could still 
> be used as the character store for many of these encodings with the 
> meta data stored separately (probably a RubyString wrapper object) and 
> the char set insensive JS string methods could be used to implement 
> the Ruby semantics.

Did I get a hit off your pitch, then? Because Ruby does at least raise 
exceptions on mixed encoding concatenations.

But I'm about to strike out on the next pitch (language). You're almost 
certainly right that most languages with "full Unicode" support allow 
the programmer to create invalid strings via literals and constructors. 
It also seems common for charset-encoding APIs to validate and throw on 
invalid character, which makes sense.

I could live with this, especially for String.fromCharCode.

For "\uD800..." in a BRS-enabled string literal, it still seems to me 
something is going to go wrong right away. Or rather, something *should* 
(like, early error). But based on precedent, and for the odd usage that 
doesn't go wrong ever (reads back code units, or has native code reading 
them and reassembling uint16 elements), I'll go along here too.

This means Gavin's option

2) Allow invalid unicode characters in strings, and preserve them over 
concatenation – ("\uD800" + "\uDC00").length == 2.

as you noted in reply to him.

> BRS excluding surrogate codes would at the very least require 
> additional special case handling when dealing with Ruby strings 
> containing those code points.

I suspect Ruby-on-JS is best done via Emscripten (demo'ed at JSConf.eu 
2011), which makes this moot. With Emscripten you get the canonical Ruby 
implemlentation, not a hand-coded JS work(mostly)alike.

> Yes I meant the Emscripten runtime "foreign" call support for calling 
> JS functions. I did mean censor in that sense. Assume that you want to 
> automatically convert WCHAR*

Again, wchar_t is not uint16 on all platforms.

At this point I'm not going to try my luck at bat again. Gavin's option 
2 at least preserves .length distributivity over concatenation. So let's 
go on to other issues. What's next?

/be
Received on Tuesday, 21 February 2012 05:03:34 UTC