W3C home > Mailing lists > Public > public-script-coord@w3.org > January to March 2012

Re: New full Unicode for ES6 idea

From: Erik Corry <erik.corry@gmail.com>
Date: Fri, 2 Mar 2012 08:58:12 +0100
Message-ID: <CAP40CR3MS3xEzn4eOMeQfhZaiubpvvCLt-_bq79+aTvvACJkng@mail.gmail.com>
To: Glenn Adams <glenn@skynav.com>
Cc: (wrong string) ‚˜• <mark@macchiato.com>, Cameron McCormack <cam@mcc.id.au>, "public-script-coord@w3.org" <public-script-coord@w3.org>, Brendan Eich <brendan@mozilla.com>, Anne van Kesteren <annevk@opera.com>, "mranney@voxer.com" <mranney@voxer.com>, es-discuss <es-discuss@mozilla.org>
2012/3/1 Glenn Adams <glenn@skynav.com>:
> 2012/3/1 Erik Corry <erik.corry@gmail.com>
>> I'm not in favour of big red switches, and I don't think the
>> compartment based solution is going to be workable.
>> I'd like to plead for a solution rather like the one Java has, where
>> strings are sequences of UTF-16 codes and there are specialized ways
>> to iterate over them. †Looking at this entry from the Unicode FAQ:
>> http://unicode.org/faq/char_combmark.html#7 there are different ways
>> to describe the length (and iteration) of a string. †The BRS proposal
>> favours #2, but I think for most applications utf-16-based-#1 is just
>> fine, and for the applications that want to "do it right" #3 is almost
>> always the correct solution. †Solution #3 needs library support in any
>> case and has no problems with UTF-16.
>> The central point here is that there are combining characters
>> (accents) that you can't just normalize away. †Getting them right has
>> a lot of the same issues as surrogate pairs (you shouldn't normally
>> chop them up, they count as one 'character', you can't tell how many
>> of them there are in a string without looking, etc.). †If you can
>> handle combining characters then the surrogate pair support falls out
>> pretty much for free.
> The problem here is that you are mixing apples and oranges. Although it
> *may* appear that surrogate pairs and grapheme clusters have features in
> common, they operate at different semantic levels entirely. A solution that
> attempts to conflate these two levels is going to cause problems at both
> levels. A distinction should be maintained between the following levels:
> encoding units (e.g., UTF-16 coding units)
> unicode scalar values (code points)
> grapheme clusters

This distinction is not lost on me.  I propose that random access
indexing and .length in JS should work on level 1, and there should be
library support for levels 2 and 3.  In order of descending usefulness
I think the order is 1, 3, 2.  Therefore I don't want to cause a lot
of backwards compatibility headaches by prioritizing the efficient
handling of level 2.

> IMO, the current discussion should limit itself to the interface between the
> first and second of these levels, and not introduce the third level into the
> mix.
> G.
Received on Friday, 2 March 2012 07:58:43 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:14:05 UTC