Re: New full Unicode for ES6 idea from Brendan Eich on 2012-02-20 (public-script-coord@w3.org from January to March 2012)

From: Brendan Eich <brendan@mozilla.com>
Date: Mon, 20 Feb 2012 15:14:13 -0800
To: Allen Wirfs-Brock <allen@wirfs-brock.com>
CC: Gavin Barraclough <barraclough@apple.com>, public-script-coord@w3.org, Anne van Kesteren <annevk@opera.com>, mranney@voxer.com, es-discuss discussion <es-discuss@mozilla.org>
Message-ID: <4F42D3C5.9030701@mozilla.com>
Allen Wirfs-Brock wrote:
> On Feb 20, 2012, at 12:32 PM, Brendan Eich wrote:
>
>> Allen Wirfs-Brock wrote:
>>> ...
>>> You are essentially saying that a compiler targeting ES for a language X  that includes a string data type that does not confirm to your rules (for example, by allowing occurrences of surrogate code points within string data)
>> First, as a point of order: yes, JS strings as full Unicode does not want stray surrogate pair-halves. Does anyone disagree?
>
> Well, I'm disagreeing.  Do you know of any other language that has imposed these sorts of semantic restrictions on runtime string data?
Sure, Java:


      String

public*String*(int[] codePoints,
               int offset,
               int count)

    Allocates a new|String|that contains characters from a subarray of
    the Unicode code point array argument. The|offset|argument is the
    index of the first code point of the subarray and the|count|argument
    specifies the length of the subarray. The contents of the subarray
    are converted to|char|s; subsequent modification of the|int|array
    does not affect the newly created string.

    *Parameters:*
        |codePoints|- array that is the source of Unicode code points.
        |offset|- the initial offset.
        |count|- the length.
    *Throws:*
        |IllegalArgumentException
        <http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/IllegalArgumentException.html>|-
        if any invalid Unicode code point is found in|codePoints|
        |IndexOutOfBoundsException
        <http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/IndexOutOfBoundsException.html>|-
        if the|offset|and|count|arguments index characters outside the
        bounds of the|codePoints|array.
    *Since:*
        1.5



>> Second, binary data / typed arrays stand ready for any such not-full-Unicode use-cases.
>
> But lacks the same level of utility function support, not the least of which is RegExp

RegExp is miserabe for Unicode, it's true. That doesn't strike me as 
compelling for making full-Unicode string more bug-prone.

There is a strong case to be made for evolving RegExp to be usable with 
certain typed arrays (byte, uint16 at least). But that's another thread.

We should beef up RegExp Unicode escapes; another 'nother thread.

>>> could not use ES strings as the target representation of its string data type.  It also could not use the built-in ES string functions in the implementation of language X's built-in functions.
>> Not if this hypothetical source language being compiled to JS wants other than full Unicode, no.
>>
>> Why is this a problem, even hypothetically? Such a use-case has binary data and typed arrays standing ready, and if it really could use String.prototype.* methods I would be greatly surprised.
>
> My sense is that there are a fairly large variety of string data types could be use the existing ES5 string type as a target type and for which many of the String.prototuype.* methods would function just fine  The reason is that most of the ES5 methods don't impose this sort of semantic restriction of string elements.

If that's true then we should have enough evidence that I'll happily 
concede the point and the spec will allow "uD800" etc. in BRS-enabled 
literals. I do not see such evidence.

>>> It could not leverage any optimizations that a ES engine may apply to strings and string functions.
>> Emscripten already compiles LLVM source languages (C, C++, and Objective-C at least) to JS and does a very good job (getting better day by day). The utility of string function today (including uint16 indexing and length) is immaterial. Typed arrays are quite important, though.
>
> There are a lot of reasons why ES strings are not a good backing representation for C/C++ strings (to the extend that there even is a C string data type).  But there are also lots of  high level languages that do not have those sort of mapping issues.

Let's name some:

Java: see above. There may be some legacy need to support invalid 
Unicode but I'm not seeing it right now. Anyone?

Python: 
http://docs.python.org/release/3.0.1/howto/unicode.html#the-string-type 
-- lots here, similar to Ruby 1.9 (see below) but not obviously in need 
of invalid Unicode stored in uint16 vectors handled as JS strings.

Ruby: Ruby supports strings with multiple encodings; the encoding is 
part of the string's metadata. I am not the expert here, but I found

http://www.tbray.org/ongoing/When/200x/2008/09/18/Ruby-I18n

helpful, and the more recent

http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/

too. See also these very interesting posts from Sam Ruby in 2007:

http://intertwingly.net/blog/2007/12/28/3-1-2
http://intertwingly.net/blog/2007/12/29/Ruby-1-9-Strings-Updated

Ruby raises exceptions when you mix two strings with different encodings 
incorrectly, e.g. by concatenation.

I'm not sure about surrogate validation, but from all this I gather that 
compiling Ruby to JS in full would need Uint8Array, along with lots of 
runtime helpers that do not come for free from JS's String.prototype 
methods, in order to handle all the encodings.

> If Type arrays are going to be the new "string" type  (maybe better stated as array of chars) for people doing systems programming in JS then we should probably start thinking about a broader set of utility functions/methods that support them.

Yes, that's probably true. We'll keep feedback coming from Emscripten 
users and experts.

>>> Also, values of X's string type can not be directly passed in foreign calls to ES functions. Etc.
>> Emscripten does have a runtime that maps browser functionailty exposed to JS to the guest language. It does not AFAIK need to encode surrogate pairs in JS strings by hand, let alone make pair-halves.
>
> But with the BRS flipped it would have to censor C "strings" passed to JS to ensure that unmatched surrogate pairs are present.

Your "it" there would be "Emscripten [would have to censor]"? I don't 
think so, or: I do not agree that "censor" is an apt description -- it 
seems loaded by implying something censorious is needed where without 
the error I'm proposing for [D800-DFFF], no censoring action would be 
needed.

ISO C says sizeof(char) == 1, so byte strings / string constants are 
either ISO 8859-1 and cannot form surrogates when zero-extended to 16 or 
21 bits, or they're in some character set that needs more involved 
transcoding but again cannot by itself create surrogates.

C wide strings vary by platform. On some platforms wchar_t is 32 bits.

In any event, Emscripten currently does not use JS strings at all in its 
code generation (only internally in its JS-hosted libc).

> Probably not such a bit deal because it isn't using JS strings as its representation, but as hypothesized above that wouldn't necessarily be the case for other languages.

I don't see it. I may have missed it in my survey of Java, Python and 
Ruby. Please let me know if so.

/be
Received on Monday, 20 February 2012 23:14:37 UTC