W3C home > Mailing lists > Public > public-script-coord@w3.org > January to March 2012

Re: New full Unicode for ES6 idea

From: Erik Corry <erik.corry@gmail.com>
Date: Thu, 1 Mar 2012 13:58:04 +0100
Message-ID: <CAP40CR28yADVHQgALA3numziGKy_KWSKBjetZ5YG=zSavPyuAQ@mail.gmail.com>
To: "Phillips, Addison" <addison@lab126.com>
Cc: (wrong string) ☕ <mark@macchiato.com>, Cameron McCormack <cam@mcc.id.au>, "public-script-coord@w3.org" <public-script-coord@w3.org>, Brendan Eich <brendan@mozilla.com>, Anne van Kesteren <annevk@opera.com>, "mranney@voxer.com" <mranney@voxer.com>, es-discuss <es-discuss@mozilla.org>
I'm not in favour of big red switches, and I don't think the
compartment based solution is going to be workable.

I'd like to plead for a solution rather like the one Java has, where
strings are sequences of UTF-16 codes and there are specialized ways
to iterate over them.  Looking at this entry from the Unicode FAQ:
http://unicode.org/faq/char_combmark.html#7 there are different ways
to describe the length (and iteration) of a string.  The BRS proposal
favours #2, but I think for most applications utf-16-based-#1 is just
fine, and for the applications that want to "do it right" #3 is almost
always the correct solution.  Solution #3 needs library support in any
case and has no problems with UTF-16.

The central point here is that there are combining characters
(accents) that you can't just normalize away.  Getting them right has
a lot of the same issues as surrogate pairs (you shouldn't normally
chop them up, they count as one 'character', you can't tell how many
of them there are in a string without looking, etc.).  If you can
handle combining characters then the surrogate pair support falls out
pretty much for free.

Advantages of my proposal:

* High level of backwards compatibility
* No issues of where to place the BRS
* Compact and simple in the implementation
* Can be polyfilled on most VMs
* Interaction with the DOM is unproblematic
* No issues of what happens on concatenation if a surrogate pair is created.

Details:

* The built in string charCodeAt, [], length operations work in terms of UTF-16
* String.fromCharCode(x) can return a string with a length of 2
* New object StringIterator

new StringIterator(backing) returns a string iterator.  The iterator
has the following methods:

hasNext();  // Returns this.index() != this.backing().length
nextGrapheme();  // Returns the next grapheme as a unicode code point,
or -1 if the next grapheme is a sequence of code points
nextGraphemeArray(); // Returns an array of numeric code points
(possibly just one) representing the next grapheme
nextCodePoint(); // Returns the next code point, possibly consuming
two surrogate pairs
index();  // Gets the current index in the string, from 0 to length
setIndex();  // Sets the current index in the string, from 0 to length
backing();  // Get the backing string

// Optionally
hasPrevious();
previous*();  // Analogous to nextGrapheme etc.
codePointLength(); // Takes O(length), cache the answer if you care
graphemeLength();  // Ditto

If any of the next.. functions encounter an unmatched half of a
surrogate pair they just return its number.

Regexp support.  Regexps act 'as if' the following steps were performed.

Outside character classes an extended character turns into (?:xy)
where x and y are the surrogate pairs.
Inside positive character classes the extended characters are
extracted so [abz] becomes (?:[ab]|xy) where z is an extended
character and x and y are the surrogate pairs.
Negative character classes can be handled by transforming into
negative lookaheads.
A decent set of unicode character classes will likely subsume most
uses of these transformations.

Perhaps the BRS 21 bit solution feels marginally cleaner, but having
two different kinds of strings in the same VM feels like a horrible
solution that is user visible and will haunt implementations forever,
and the cleanliness difference is very marginal given that grapheme
based iteration is the correct solution for almost all the cases where
iterating over utf-16 codes is not good enough.

-- 
Erik Corry

2012/2/20 Phillips, Addison <addison@lab126.com>:
> Mark wrote:
>
>
>
> First, it would be great to get full Unicode support in JS. I know that's
> been a problem for us at Google.
>
>
>
> AP> +1: I think we’ve waited for supplementary character support long
> enough!
>
>
>
> Secondly, while I agree with Addison that the approach that Java took is
> workable, it does cause problems.
>
>
>
> AP> The tension is between “compatibility” and “ease of use” here, I think.
> The question is whether very many scripts depend on the ‘uint16’ nature of a
> character in ES, use surrogates to effect supplementary character support,
> or are otherwise tied to the existing encoding model and are broken as a
> result of changes. In its ideal form, an ES string would logically be a
> sequence of Unicode characters (code points) and only the internal
> representation would worry about whatever character encoding scheme made the
> most sense (in many cases, this might actually be UTF-16).
>
>
>
> AP> … but what I think is hard to deal with are different modes of
> processing scripts depending on “fullness of the Unicode inside”.
> Admittedly, the approach I favor is rather conservative and presents a
> number of challenges, most notably in adapting regex or for users who want
> to work strictly in terms of character values.
>
>
>
> There are good reasons for why Java did what it did, basically for
> compatibility. But if there is some way that JS can work around those,
> that'd be great.
>
>
>
> AP> Yes, it would.
>
>
>
> ~Addison
>
>
>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss@mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
Received on Thursday, 1 March 2012 12:58:38 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 8 May 2013 19:30:05 UTC