Re: New full Unicode for ES6 idea from Glenn Adams on 2012-02-25 (public-script-coord@w3.org from January to March 2012)

From: Glenn Adams <glenn@skynav.com>
Date: Sat, 25 Feb 2012 09:19:22 -0700
To: Anne van Kesteren <annevk@opera.com>
Cc: Norbert Lindenberg <ecmascript@norbertlindenberg.com>, Brendan Eich <brendan@mozilla.com>, es-discuss <es-discuss@mozilla.org>, "public-script-coord@w3.org" <public-script-coord@w3.org>, mranney@voxer.com
Message-ID: <CACQ=j+c_1_7VuUgkRyk5XCv+AMwnZsS9aPvnCKt1hreh1iDA9w@mail.gmail.com>

On Sat, Feb 25, 2012 at 3:00 AM, Anne van Kesteren <annevk@opera.com> wrote:

> On Sat, 25 Feb 2012 04:50:28 +0100, Brendan Eich <brendan@mozilla.com>
> wrote:
>
>> Norbert Lindenberg wrote:
>>
>>> OK - migrations are hard. But so far most participants have only seen
>>> additional work, no benefits. How long will this take? When will it end?
>>> When will browsers make BRS-on the default, let alone eliminate the switch?
>>> When can Roozbeh abandon his original version? Where's the blue button?
>>>
>>
>> It may be that the BRS is worse than an incompatible change to "full
>> Unicode" as Allen proposed last year. But in either case, something gets
>> harder for Roozbeh. Which is worse?
>>
>
> Is the benefit of doing this switch at all large enough? Even though it
> becomes somewhat easier to deal with 💩 you still have grapheme clusters
> you will need to work around. That is, it is not clear code points are the
> right abstraction point and then you might as well keep 16-bit code units
> with support for surrogate code points so you can render everything above
> BMP too.

I recall the day in the early 90s in the Unicode Technical Committee when
we came up with UTF-16. At the time I feared the necessity for this, but
here we are.

To answer Anne, I concur that Unicode scalar values (also known as Unicode
code points) as opposed to encoded coding elements, i.e., code units, e.g.,
16-bit units of UTF-16, are the correct choice. Grapheme clusters remain in
the text processing (i.e., abstract character) domain, and not the encoded
character domain. Any equation of "surrogate pair" with "grapheme cluster"
is an abuse of terminology, and is *not* a grapheme cluster notwithstanding
misguided statements to the contrary.

As a reminder:

*Code Point*. (1) Any value in the Unicode codespace; that is, the range of
integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters
and Encoding <http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf#G2212>.)
(2) A value, or position, for a character, in any coded character set.

*Grapheme*. (1) A minimally distinctive unit of writing in the context of a
particular writing system. For example, ‹b› and ‹d› are distinct graphemes
in English writing systems because there exist distinct words like big and
dig. Conversely, a lowercase italiform letter *a* and a lowercase Roman
letter a are not distinct graphemes because no word is distinguished on the
basis of these two different forms. (2) What a user thinks of as a
character.

and

C1 A process shall not interpret a high-surrogate code point or a
low-surrogate code point as an abstract character.

Received on Saturday, 25 February 2012 16:20:10 UTC