Re: New full Unicode for ES6 idea from Glenn Adams on 2012-03-02 (public-script-coord@w3.org from January to March 2012)

From: Glenn Adams <glenn@skynav.com>
Date: Fri, 2 Mar 2012 01:18:14 -0700
To: Erik Corry <erik.corry@gmail.com>
Cc: "Phillips, Addison" <addison@lab126.com>, Mark Davis ☕ <mark@macchiato.com>, Cameron McCormack <cam@mcc.id.au>, "public-script-coord@w3.org" <public-script-coord@w3.org>, Brendan Eich <brendan@mozilla.com>, Anne van Kesteren <annevk@opera.com>, "mranney@voxer.com" <mranney@voxer.com>, es-discuss <es-discuss@mozilla.org>
Message-ID: <CACQ=j+cYdxW=xq7zYqKeZ13-Ao5HLfn+-U=kNaJyBRdoUm+vYQ@mail.gmail.com>

On Fri, Mar 2, 2012 at 12:58 AM, Erik Corry <erik.corry@gmail.com> wrote:

> 2012/3/1 Glenn Adams <glenn@skynav.com>:
> >> I'd like to plead for a solution rather like the one Java has, where
> >> strings are sequences of UTF-16 codes and there are specialized ways
> >> to iterate over them.  Looking at this entry from the Unicode FAQ:
> >> http://unicode.org/faq/char_combmark.html#7 there are different ways
> >> to describe the length (and iteration) of a string.  The BRS proposal
> >> favours #2, but I think for most applications utf-16-based-#1 is just
> >> fine, and for the applications that want to "do it right" #3 is almost
> >> always the correct solution.  Solution #3 needs library support in any
> >> case and has no problems with UTF-16.
> >>
> >> The central point here is that there are combining characters
> >> (accents) that you can't just normalize away.  Getting them right has
> >> a lot of the same issues as surrogate pairs (you shouldn't normally
> >> chop them up, they count as one 'character', you can't tell how many
> >> of them there are in a string without looking, etc.).  If you can
> >> handle combining characters then the surrogate pair support falls out
> >> pretty much for free.
> >
> >
> > The problem here is that you are mixing apples and oranges. Although it
> > *may* appear that surrogate pairs and grapheme clusters have features in
> > common, they operate at different semantic levels entirely. A solution
> that
> > attempts to conflate these two levels is going to cause problems at both
> > levels. A distinction should be maintained between the following levels:
> >
> > (1) encoding units (e.g., UTF-16 coding units)
> > (2) unicode scalar values (code points)
> > (3) grapheme clusters
>
> This distinction is not lost on me.  I propose that random access
> indexing and .length in JS should work on level 1,


that's where we are today: indexing and length based on 16-bit code units
(of a UTF-16 encoding, likewise with Java)


> and there should be
> library support for levels 2 and 3.  In order of descending usefulness
> I think the order is 1, 3, 2.  Therefore I don't want to cause a lot
> of backwards compatibility headaches by prioritizing the efficient
> handling of level 2.


from a perspective of indexing "Unicode characters", level 2 is the correct
place;

level 3 is useful for higher level, language/locale sensitive text
processing, but not particularly interesting at the basic ES string
processing level; we aren't talking about (or IMO should not be talking
about) a level 3 text processing library in this thread;

Received on Friday, 2 March 2012 08:19:03 UTC