RE: New full Unicode for ES6 idea from Phillips, Addison on 2012-02-21 (public-script-coord@w3.org from January to March 2012)

From: Phillips, Addison <addison@lab126.com>
Date: Tue, 21 Feb 2012 09:14:41 -0800
To: Wes Garland <wes@page.ca>, Brendan Eich <brendan@mozilla.com>
CC: Allen Wirfs-Brock <allen@wirfs-brock.com>, "public-script-coord@w3.org" <public-script-coord@w3.org>, Anne van Kesteren <annevk@opera.com>, "mranney@voxer.com" <mranney@voxer.com>, es-discuss discussion <es-discuss@mozilla.org>
Message-ID: <131F80DEA635F044946897AFDA9AC3476AA7D5C951@EX-SEA31-D.ant.amazon.com>

Because it has always been possible, it’s difficult to say how many scripts have transported byte-oriented data by “punning” the data into strings. Actually, I think this is more likely to be truly binary data rather than text in some non-Unicode character encoding, but anything is possible, I suppose. This could include using non-character values like “FFFE”, “FFFF” in addition to the surrogates. A BRS-running implementation would break a script that relied on String being a sequence of 16-bit unsigned integer values with no error checking.

One of my examples, GB 18030, is a four-byte encoding and a Chinese government standard. It is a mapping onto Unicode, but this mapping is table-driven rather than algorithm driven like the UTF-* transport formats. To provide a single example, Unicode 0x2259 maps onto GB 18030 0x8136D830.
AP> GB 18030 is more complex than that. Not all characters are four-byte, for example. As a multibyte encoding, you might choose to “pun” GB 18030 into a String as 81 36 d8 30. There isn’t much attraction to punning it into 0x8136 0xd830, but, as noted above, someone might be foolish enough to try it ;-). Scripts that rely on this probably break under BRS.

You're right about Big5 being byte-oriented, maybe this was a bad example, although it is a double-byte charset. It works by putting ASCII down low making bytes above 0x7f escapes into code pages dereferenced by the next byte. Each code point is encoded with one or two bytes, never more. If I were developing with Big5 in JS, I would store the byte stream 4a 4b d8 00 c1 c2 4c as 004a 004b d800 c1c2 004c. This would allow me to use JS regular expressions and so on.
Not exactly. The trailing bytes in Big5 start at 0x40, for example. But it is certainly the case that some multibyte characters in Big5 happen to have the same byte-pair as a surrogate code point (when considered as a pair of bytes) or other non-character in the Unicode BMP, and one might (he says, squinting really hard) want to do as you suggest and record the multibyte sequence as a single code point.
But the data does not need to arrive from C API -- it could easily be delivered by an XHR request where, say, the remote end dumps database rows into a transport format based around evaluating JS string literals (like JSON).
Allowing isolated invalid sequences isn’t actually the problem, if you think about it. Yes, the data is bad and yes you can’t view it cleanly. But you can do whatever you need to on it.
The problem is when you intend to store two values that end up as a single character. If I have a string with code points “f235 5e7a e040 d800”, the d800 does no particular harm. The problem is: if I construct a BRS string using that sequence and then concatenate the sequence “dc00 a053 3254” onto it, the resulting string is only *six* characters long, rather than the expected seven, since presumably the d800 dc00 pair turns into U+10000.
Addison

Received on Tuesday, 21 February 2012 17:18:51 UTC