- From: Glenn Adams <glenn@skynav.com>
- Date: Sat, 25 Feb 2012 09:19:22 -0700
- To: Anne van Kesteren <annevk@opera.com>
- Cc: Norbert Lindenberg <ecmascript@norbertlindenberg.com>, Brendan Eich <brendan@mozilla.com>, es-discuss <es-discuss@mozilla.org>, "public-script-coord@w3.org" <public-script-coord@w3.org>, mranney@voxer.com
- Message-ID: <CACQ=j+c_1_7VuUgkRyk5XCv+AMwnZsS9aPvnCKt1hreh1iDA9w@mail.gmail.com>
On Sat, Feb 25, 2012 at 3:00 AM, Anne van Kesteren <annevk@opera.com> wrote: > On Sat, 25 Feb 2012 04:50:28 +0100, Brendan Eich <brendan@mozilla.com> > wrote: > >> Norbert Lindenberg wrote: >> >>> OK - migrations are hard. But so far most participants have only seen >>> additional work, no benefits. How long will this take? When will it end? >>> When will browsers make BRS-on the default, let alone eliminate the switch? >>> When can Roozbeh abandon his original version? Where's the blue button? >>> >> >> It may be that the BRS is worse than an incompatible change to "full >> Unicode" as Allen proposed last year. But in either case, something gets >> harder for Roozbeh. Which is worse? >> > > Is the benefit of doing this switch at all large enough? Even though it > becomes somewhat easier to deal with 💩 you still have grapheme clusters > you will need to work around. That is, it is not clear code points are the > right abstraction point and then you might as well keep 16-bit code units > with support for surrogate code points so you can render everything above > BMP too. I recall the day in the early 90s in the Unicode Technical Committee when we came up with UTF-16. At the time I feared the necessity for this, but here we are. To answer Anne, I concur that Unicode scalar values (also known as Unicode code points) as opposed to encoded coding elements, i.e., code units, e.g., 16-bit units of UTF-16, are the correct choice. Grapheme clusters remain in the text processing (i.e., abstract character) domain, and not the encoded character domain. Any equation of "surrogate pair" with "grapheme cluster" is an abuse of terminology, and is *not* a grapheme cluster notwithstanding misguided statements to the contrary. As a reminder: *Code Point*. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding <http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf#G2212>.) (2) A value, or position, for a character, in any coded character set. *Grapheme*. (1) A minimally distinctive unit of writing in the context of a particular writing system. For example, ‹b› and ‹d› are distinct graphemes in English writing systems because there exist distinct words like big and dig. Conversely, a lowercase italiform letter *a* and a lowercase Roman letter a are not distinct graphemes because no word is distinguished on the basis of these two different forms. (2) What a user thinks of as a character. and C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character.
Received on Saturday, 25 February 2012 16:20:10 UTC