W3C home > Mailing lists > Public > public-script-coord@w3.org > January to March 2012

Re: New full Unicode for ES6 idea

From: Erik Corry <erik.corry@gmail.com>
Date: Fri, 2 Mar 2012 10:13:38 +0100
Message-ID: <CAP40CR1+iYXpBtS8sWTX9-++rw2OuYaw+-TK9hc3QmWASTXFsQ@mail.gmail.com>
To: Glenn Adams <glenn@skynav.com>
Cc: (wrong string) ‚˜• <mark@macchiato.com>, Cameron McCormack <cam@mcc.id.au>, "public-script-coord@w3.org" <public-script-coord@w3.org>, Brendan Eich <brendan@mozilla.com>, Anne van Kesteren <annevk@opera.com>, "mranney@voxer.com" <mranney@voxer.com>, es-discuss <es-discuss@mozilla.org>
2012/3/2 Glenn Adams <glenn@skynav.com>:
>
> On Fri, Mar 2, 2012 at 12:58 AM, Erik Corry <erik.corry@gmail.com> wrote:
>>
>> 2012/3/1 Glenn Adams <glenn@skynav.com>:
>> >> I'd like to plead for a solution rather like the one Java has, where
>> >> strings are sequences of UTF-16 codes and there are specialized ways
>> >> to iterate over them. †Looking at this entry from the Unicode FAQ:
>> >> http://unicode.org/faq/char_combmark.html#7 there are different ways
>> >> to describe the length (and iteration) of a string. †The BRS proposal
>> >> favours #2, but I think for most applications utf-16-based-#1 is just
>> >> fine, and for the applications that want to "do it right" #3 is almost
>> >> always the correct solution. †Solution #3 needs library support in any
>> >> case and has no problems with UTF-16.
>> >>
>> >> The central point here is that there are combining characters
>> >> (accents) that you can't just normalize away. †Getting them right has
>> >> a lot of the same issues as surrogate pairs (you shouldn't normally
>> >> chop them up, they count as one 'character', you can't tell how many
>> >> of them there are in a string without looking, etc.). †If you can
>> >> handle combining characters then the surrogate pair support falls out
>> >> pretty much for free.
>> >
>> >
>> > The problem here is that you are mixing apples and oranges. Although it
>> > *may* appear that surrogate pairs and grapheme clusters have features in
>> > common, they operate at different semantic levels entirely. A solution
>> > that
>> > attempts to conflate these two levels is going to cause problems at both
>> > levels. A distinction should be maintained between the following levels:
>> >
>> > (1) encoding units (e.g., UTF-16 coding units)
>> > (2) unicode scalar values (code points)
>> > (3) grapheme clusters
>>
>> This distinction is not lost on me. †I propose that random access
>> indexing and .length in JS should work on level 1,
>
>
> that's where we are today: indexing and length based on 16-bit code units
> (of a UTF-16 encoding, likewise with Java)

Not really for JS.  Missing parts in the current UTF-16 support have
been listed in this thread, eg in Norbert Lindenberg's 6 point
prioritization list, which I replied to yesterday.

>> and there should be
>> library support for levels 2 and 3. †In order of descending usefulness
>> I think the order is 1, 3, 2. †Therefore I don't want to cause a lot
>> of backwards compatibility headaches by prioritizing the efficient
>> handling of level 2.
>
>
> from a perspective of indexing "Unicode characters", level 2 is the correct
> place;

Yes, by definition.

> level 3 is useful for higher level, language/locale sensitive text

No, the Unicode grapheme clustering algorithm is not locale or
language sensitive
http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

> processing, but not particularly interesting at the basic ES string
> processing level; we aren't talking about (or IMO should not be talking
> about) a level 3 text processing library in this thread;

I will continue to feel free to talk about it as I believe that in the
cases where just indexing by UTF-16 words is not sufficient it is
normally level 3 that is the correct level.  Also, I think there
should be support for this level in JS as it is not locale-dependent.

-- 
Erik Corry
Received on Friday, 2 March 2012 09:14:28 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 8 May 2013 19:30:05 UTC