Re: New full Unicode for ES6 idea from Allen Wirfs-Brock on 2012-02-20 (public-script-coord@w3.org from January to March 2012)

From: Allen Wirfs-Brock <allen@wirfs-brock.com>
Date: Sun, 19 Feb 2012 21:45:32 -0800
To: Gavin Barraclough <barraclough@apple.com>
Cc: Brendan Eich <brendan@mozilla.com>, public-script-coord@w3.org, Anne van Kesteren <annevk@opera.com>, mranney@voxer.com, es-discuss discussion <es-discuss@mozilla.org>
Message-Id: <7DC52439-53E7-4E3D-B0DF-E52E1F79EE06@wirfs-brock.com>

On Feb 19, 2012, at 6:54 PM, Gavin Barraclough wrote:

> On Feb 19, 2012, at 3:13 PM, Allen Wirfs-Brock wrote:
>>> My implementor's bias is showing, because I expect many engines would use UTF-16 internally and have non-O(1) indexing for strings with the contains-non-BMP-and-BRS-set-to-full-Unicode flag bit.
>> 
>> A fine implementation, but not observable.  Another implementation approach that would preserve O(1) indexing would be to simply have two or three different internal string representations with 1, 2, or 4 byte internal characters.  (You can automatically pick the needed character size when the string is created because string are immutable and created with their value).  A not-quite O(1) approach would segment strings into substring spans using such an representation.   Representation choice probably depends a lot on what you think are the most common use cases.  If it is string processing in JS then a fast representations is probably what you want to choose.  If it is just passing text  that is already UTF-8 or UTF-16  encoded from inputs to output then a representation that minimizing transcoding would probably be a higher priority.
> 
> 
> One way in which the proposal under discussion seems to differ from the previous strawman is in the behavior arising from concatenation of strings ending/beginning with a surrogate hi and lo element.
> How do we want to handle how do we want to handle unpaired UTF-16 surrogates in a full-unicode string?  I can see three options:
> 
> 1) Prohibit values from strings that do not map to valid unicode characters (either throw an exception, or replace with the unicode replacement character).
> 2) Allow invalid unicode characters in strings, and preserve them over concatenation – ("\uD800" + "\uDC00").length == 2.
> 3) Allow invalid unicode characters in strings, but allow surrogate pairs to fuse over concatenation – ("\uD800" + "\uDC00").length == 1.
> 
> It seems desirable for full-unicode strings to logically be a sequence of unicode characters, stored and processed in a encoding-agnostic manner.  option 3 would seem to violate that, exposing the underlying UTF-16 implementation.  It also loses a distributive property of .length over concatenation that I believe is true in ES5 for strings, in that currently for all strings s1 & s2:
> 	s1.length + s2.length == (s1 + s2).length
> However if we allow concatenation to fuse surrogate pairs into a single character (e.g. s1 = "\uD800", s2 = "\uDC00") this will no longer be true.
> 
> I guess I wonder if it's worth considering either options 1) or 2) – either prohibiting invalid unicode characters in strings, or consider something closer to the previous strawman, where string storage is defined to be 32-bit (with a BRS that instead of changing iteration would change string creation, introducing an implicit UTF16-UTF32 conversion).

I think 2) is the only reasonable alternative.

I don't think 1) would be a very good choice, if for no other reason the set of valid unicode characters is a moving target that you wouldn't want to hardwire into either the ES specification or implementations.

More importantly, some applications require "string processing" strings containing invalid unicode characters.  In particular, any sort of transcoders between character sets requires this. If you want to take a full unicode string, convert it to UTF-16 and then output it, you may generate an intermediate strings with elements that contain individual high and low surrogate codes.  If you were transcoding to a non-Unicode character set any value might be possible.

I really don't think any Unicode semantics should be build into the basic string representation.  We need to decide on a max element size and Unicode motivates 21 bits, but it could be 32-bits.  Personally, I've lived through enough address space exhaustion episodes in my career be skeptical of "small" values like 2^21 being good enough for the long term.

Allen

Received on Monday, 20 February 2012 05:46:10 UTC