W3C home > Mailing lists > Public > public-script-coord@w3.org > January to March 2012

Re: New full Unicode for ES6 idea

From: Mark Davis ☕ <mark@macchiato.com>
Date: Sun, 19 Feb 2012 16:25:31 -0800
Message-ID: <CAJ2xs_EiRVZmLJHTpx8yZ8R6sh85TqPmxqb9vAQzcvVCyZ5W8w@mail.gmail.com>
To: Cameron McCormack <cam@mcc.id.au>
Cc: Brendan Eich <brendan@mozilla.com>, "public-script-coord@w3.org" <public-script-coord@w3.org>, Anne van Kesteren <annevk@opera.com>, mranney@voxer.com, es-discuss <es-discuss@mozilla.org>
First, it would be great to get full Unicode support in JS. I know that's
been a problem for us at Google.

Secondly, while I agree with Addison that the approach that Java took is
workable, it does cause problems. Ideally someone would be able to loop (a
very common construct) with:

for (codepoint cp : someString) {
  doSomethingWith(cp);
}

In Java, you have to do:

int cp;
for (int i = 0; i < someString.length(); i += Character.countChar(cp)) {
  cp = someString.codePointAt(i);
  doSomethingWith(cp);
}

There are good reasons for why Java did what it did, basically for
compatibility. But if there is some way that JS can work around those,
that'd be great.

3. There's some confusion about the Unicode terminology. Here's a quick
clarification:

code point: number from 0 to 0x10FFFF

character: a code point that is assigned. Eg, 0x61 represents 'a' and is a
character. 0x378 is a code point, but not (yet) a character.

code unit: an encoding 'chunk'.
UTF-8 represents a code point as 1-4 8-bit code units
UTF-16 represents a code point  as 2 or 4 16-bit code units
UTF-32 represents a code point as 1 32-bit code unit.

------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**



On Sun, Feb 19, 2012 at 16:00, Cameron McCormack <cam@mcc.id.au> wrote:

> Brendan Eich:
>
> > To hope to make this sideshow beneficial to all the cc: list, what do
> > DOM specs use to talk about uint16 units vs. code points?
>
> I say "code unit" as a shorter way of saying "16 bit unsigned integer code
> unit"
>
>  http://dev.w3.org/2006/webapi/**WebIDL/#dfn-code-unit<http://dev.w3.org/2006/webapi/WebIDL/#dfn-code-unit>
>
> (which DOM4 also links to) and then just "code point" to refer to 21 bit
> numbers that might correspond to a Unicode character, which you can see
> used in
>
>  http://dev.w3.org/2006/webapi/**WebIDL/#dfn-obtain-unicode<http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode>
>
> ______________________________**_________________
> es-discuss mailing list
> es-discuss@mozilla.org
> https://mail.mozilla.org/**listinfo/es-discuss<https://mail.mozilla.org/listinfo/es-discuss>
>
Received on Monday, 20 February 2012 00:26:00 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 8 May 2013 19:30:05 UTC