- From: Brendan Eich <brendan@mozilla.com>
- Date: Tue, 21 Feb 2012 09:55:15 -0800
- To: "Phillips, Addison" <addison@lab126.com>
- CC: Wes Garland <wes@page.ca>, Allen Wirfs-Brock <allen@wirfs-brock.com>, "public-script-coord@w3.org" <public-script-coord@w3.org>, Anne van Kesteren <annevk@opera.com>, "mranney@voxer.com" <mranney@voxer.com>, es-discuss discussion <es-discuss@mozilla.org>
Phillips, Addison wrote: > > Because it has always been possible, it’s difficult to say how many > scripts have transported byte-oriented data by “punning” the data into > strings. Actually, I think this is more likely to be truly binary data > rather than text in some non-Unicode character encoding, but anything > is possible, I suppose. This could include using non-character values > like “FFFE”, “FFFF” in addition to the surrogates. A BRS-running > implementation would break a script that relied on String being a > sequence of 16-bit unsigned integer values with no error checking. > Allen's view of the BRS-enabled semantics would have 16-bit "GIGO" without exceptions -- you'd be storing 16-bit values, whatever their source (including "\uXXXX" literals spelling invalid characters and unmatched surrogates) in at-least-21-bit elements of strings, and reading them back. My concern and reason for advocating early or late errors on shenanigans was that people today writing surrogate pais literally and then taking extra pains in JS or C++ (whatever the host language might be) to process them as single code points and characters would be broken by the BRS-enabled behavior of separating the parts into distinct code points. But that's pessimistic. It could happen, but OTOH anyone coding surrogate pairs might want them to read back piece-wise when indexing. In that case what Allen proposes, storing each formerly 16-bit code unit, however expressed, in the wider 21-or-more-bits unit, and reading back likewise, would "just work". Sorry if this is all obvious. Mainly I want to throw in my lot with Allen's exception-free literal/constructor approach. The encoding APIs should throw on invalid Unicode but literals and strings as immutable 16-bit storage buffers should work as today. /be
Received on Tuesday, 21 February 2012 17:55:45 UTC