Re: New full Unicode for ES6 idea from Allen Wirfs-Brock on 2012-02-21 (public-script-coord@w3.org from January to March 2012)

From: Allen Wirfs-Brock <allen@wirfs-brock.com>
Date: Mon, 20 Feb 2012 17:53:56 -0800
To: Brendan Eich <brendan@mozilla.com>
Cc: Gavin Barraclough <barraclough@apple.com>, public-script-coord@w3.org, Anne van Kesteren <annevk@opera.com>, mranney@voxer.com, es-discuss discussion <es-discuss@mozilla.org>
Message-Id: <EC9EEAC0-7065-4BF8-B534-BA188B3465AA@wirfs-brock.com>
On Feb 20, 2012, at 3:14 PM, Brendan Eich wrote:

> Allen Wirfs-Brock wrote:
>> On Feb 20, 2012, at 12:32 PM, Brendan Eich wrote:
>> 
>>> Allen Wirfs-Brock wrote:
>>>> ...
>>>> You are essentially saying that a compiler targeting ES for a language X  that includes a string data type that does not confirm to your rules (for example, by allowing occurrences of surrogate code points within string data)
>>> First, as a point of order: yes, JS strings as full Unicode does not want stray surrogate pair-halves. Does anyone disagree?
>> 
>> Well, I'm disagreeing.  Do you know of any other language that has imposed these sorts of semantic restrictions on runtime string data?
> Sure, Java:
> 
> 
>     String
> 
> public*String*(int[] codePoints,
>              int offset,
>              int count)
> 
>   Allocates a new|String|that contains characters from a subarray of
>   the Unicode code point array argument. The|offset|argument is the
>   index of the first code point of the subarray and the|count|argument
>   specifies the length of the subarray. The contents of the subarray
>   are converted to|char|s; subsequent modification of the|int|array
>   does not affect the newly created string.
> 
>   *Parameters:*
>       |codePoints|- array that is the source of Unicode code points.
>       |offset|- the initial offset.
>       |count|- the length.
>   *Throws:*
>       |IllegalArgumentException
>       <http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/IllegalArgumentException.html>|-
>       if any invalid Unicode code point is found in|codePoints|
>       |IndexOutOfBoundsException
>       <http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/IndexOutOfBoundsException.html>|-
>       if the|offset|and|count|arguments index characters outside the
>       bounds of the|codePoints|array.
>   *Since:*
>       1.5
> 

 Note that the above say "invalid Unicode code point". 0xd800 is a valid Unicode code point.  It isn't a valid Unicode characters. 

See http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isValidCodePoint(int) 

Determines whether the specified code point is a valid Unicode code point value in the range of 0x0000 to 0x10FFFF inclusive. This method is equivalent to the expression:
 codePoint >= 0x0000 && codePoint <= 0x10FFFF
> 
>>> Why is this a problem, even hypothetically? Such a use-case has binary data and typed arrays standing ready, and if it really could use String.prototype.* methods I would be greatly surprised.
>> 
>> My sense is that there are a fairly large variety of string data types could be use the existing ES5 string type as a target type and for which many of the String.prototuype.* methods would function just fine  The reason is that most of the ES5 methods don't impose this sort of semantic restriction of string elements.
> 
> If that's true then we should have enough evidence that I'll happily concede the point and the spec will allow "uD800" etc. in BRS-enabled literals. I do not see such evidence.

Note my concern isn't so much about literals as it is about string elements created via String.fromCharCode

The only String.prototype method algorithms seem to have any Unicode dependencies are toLowerCase/toUpperCase and the locale variants of those methods and perhaps  localeCompare, trim (knowns Unicode white space character classification, and the regular expression based methods if the regexp is constructed with literal chars or uses character classes. 

All concat, splice, splice, substring, indexOf/lastIndexOf, non-regexp based replace and split calls all are defined in terms of string element value comparisons and don't really care about what characters set is used.

Wes Garland mentioned the possibility of using non-Unicode character sets such as Big5

> 
>>>> It could not leverage any optimizations that a ES engine may apply to strings and string functions.
>>> Emscripten already compiles LLVM source languages (C, C++, and Objective-C at least) to JS and does a very good job (getting better day by day). The utility of string function today (including uint16 indexing and length) is immaterial. Typed arrays are quite important, though.
>> 
>> There are a lot of reasons why ES strings are not a good backing representation for C/C++ strings (to the extend that there even is a C string data type).  But there are also lots of  high level languages that do not have those sort of mapping issues.
> 
> Let's name some:
> 
> Java: see above. There may be some legacy need to support invalid Unicode but I'm not seeing it right now. Anyone?

see above, it allows all Unicode points.  does not restrict strings to well formed UTF-16 encodings.


> 
> Python: http://docs.python.org/release/3.0.1/howto/unicode.html#the-string-type -- lots here, similar to Ruby 1.9 (see below) but not obviously in need of invalid Unicode stored in uint16 vectors handled as JS strings.
> 

I don't see any restrictions on inserting in that doc about strings containing \ud800 and friends.  Unless there are, BRS enabled ES strings couldn't be used  as the representation type for python strings.

The actual representation type used by conventional Python implementation isn't yet clear to me, but clearly it supports many character encodings besides Unicode: http://docs.python.org/library/codecs.html#standard-encodings 



> Ruby: Ruby supports strings with multiple encodings; the encoding is part of the string's metadata. I am not the expert here, but I found
> 
> http://www.tbray.org/ongoing/When/200x/2008/09/18/Ruby-I18n
> 
> helpful, and the more recent
> 
> http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/
> 
> too. See also these very interesting posts from Sam Ruby in 2007:
> 
> http://intertwingly.net/blog/2007/12/28/3-1-2
> http://intertwingly.net/blog/2007/12/29/Ruby-1-9-Strings-Updated
> 
> Ruby raises exceptions when you mix two strings with different encodings incorrectly, e.g. by concatenation.
> 
> I'm not sure about surrogate validation, but from all this I gather that compiling Ruby to JS in full would need Uint8Array, along with lots of runtime helpers that do not come for free from JS's String.prototype methods, in order to handle all the encodings.
> \\
>> If Type arrays are going to be the new "string" type  (maybe better stated as array of chars) for people doing systems programming in JS then we should probably start thinking about a broader set of utility functions/methods that support them.

But, using current 16-bit JS string semantics a JS string could still be used as the character store for many of these encodings with the meta data stored separately (probably a RubyString wrapper object) and the char set insensive JS string methods could be used  to implement the Ruby semantics.

BRS excluding surrogate codes would at the very least require additional special case handling when dealing with Ruby strings containing those code points.

> 
> Yes, that's probably true. We'll keep feedback coming from Emscripten users and experts.
> 
>>>> Also, values of X's string type can not be directly passed in foreign calls to ES functions. Etc.
>>> Emscripten does have a runtime that maps browser functionailty exposed to JS to the guest language. It does not AFAIK need to encode surrogate pairs in JS strings by hand, let alone make pair-halves.
>> 
>> But with the BRS flipped it would have to censor C "strings" passed to JS to ensure that unmatched surrogate pairs are present.
> 
> Your "it" there would be "Emscripten [would have to censor]"? I don't think so, or: I do not agree that "censor" is an apt description -- it seems loaded by implying something censorious is needed where without the error I'm proposing for [D800-DFFF], no censoring action would be needed.

Yes I meant the Emscripten runtime "foreign" call support for calling JS functions. I did mean censor in that sense.  Assume that you want to automatically convert WCHAR* strings to a JS string to pass as an argument to such  calls.   Today, without the BRS, you can just form a JS string containing all the WCHAR s without analyzing the UTF-16 well-formedness of the C string. With the BRS flipped you would at the very least have to make sure it is well-formed UTF-16 and either throw or remove any unpaired surrogatges.

> 
> ISO C says sizeof(char) == 1, so byte strings / string constants are either ISO 8859-1 and cannot form surrogates when zero-extended to 16 or 21 bits, or they're in some character set that needs more involved transcoding but again cannot by itself create surrogates.
> 
> C wide strings vary by platform. On some platforms wchar_t is 32 bits.
> 
> In any event, Emscripten currently does not use JS strings at all in its code generation (only internally in its JS-hosted libc).
> 
>> Probably not such a bit deal because it isn't using JS strings as its representation, but as hypothesized above that wouldn't necessarily be the case for other languages.
> 
> I don't see it. I may have missed it in my survey of Java, Python and Ruby. Please let me know if so.

:-)  How could I pass on the opportunity.  See above
> 
> /be
>
Received on Tuesday, 21 February 2012 01:54:34 UTC