W3C home > Mailing lists > Public > public-script-coord@w3.org > January to March 2012

Re: New full Unicode for ES6 idea

From: Glenn Adams <glenn@skynav.com>
Date: Thu, 1 Mar 2012 10:37:52 -0700
Message-ID: <CACQ=j+dEo5zTQ9Mbp-pwxRuXYd4VheHXS6eV2h_52xEw=oPEjg@mail.gmail.com>
To: Erik Corry <erik.corry@gmail.com>
Cc: "Phillips, Addison" <addison@lab126.com>, Mark Davis ☕ <mark@macchiato.com>, Cameron McCormack <cam@mcc.id.au>, "public-script-coord@w3.org" <public-script-coord@w3.org>, Brendan Eich <brendan@mozilla.com>, Anne van Kesteren <annevk@opera.com>, "mranney@voxer.com" <mranney@voxer.com>, es-discuss <es-discuss@mozilla.org>
2012/3/1 Erik Corry <erik.corry@gmail.com>

> I'm not in favour of big red switches, and I don't think the
> compartment based solution is going to be workable.
>
> I'd like to plead for a solution rather like the one Java has, where
> strings are sequences of UTF-16 codes and there are specialized ways
> to iterate over them.  Looking at this entry from the Unicode FAQ:
> http://unicode.org/faq/char_combmark.html#7 there are different ways
> to describe the length (and iteration) of a string.  The BRS proposal
> favours #2, but I think for most applications utf-16-based-#1 is just
> fine, and for the applications that want to "do it right" #3 is almost
> always the correct solution.  Solution #3 needs library support in any
> case and has no problems with UTF-16.
>
> The central point here is that there are combining characters
> (accents) that you can't just normalize away.  Getting them right has
> a lot of the same issues as surrogate pairs (you shouldn't normally
> chop them up, they count as one 'character', you can't tell how many
> of them there are in a string without looking, etc.).  If you can
> handle combining characters then the surrogate pair support falls out
> pretty much for free.
>

The problem here is that you are mixing apples and oranges. Although it
*may* appear that surrogate pairs and grapheme clusters have features in
common, they operate at different semantic levels entirely. A solution that
attempts to conflate these two levels is going to cause problems at both
levels. A distinction should be maintained between the following levels:

   - encoding units (e.g., UTF-16 coding units)
   - unicode scalar values (code points)
   - grapheme clusters

IMO, the current discussion should limit itself to the interface between
the first and second of these levels, and not introduce the third level
into the mix.

G.
Received on Thursday, 1 March 2012 17:38:44 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 8 May 2013 19:30:05 UTC