- From: Brendan Eich <brendan@mozilla.com>
- Date: Sun, 19 Feb 2012 19:52:23 -0800
- To: Gavin Barraclough <barraclough@apple.com>
- CC: Allen Wirfs-Brock <allen@wirfs-brock.com>, public-script-coord@w3.org, Anne van Kesteren <annevk@opera.com>, mranney@voxer.com, es-discuss discussion <es-discuss@mozilla.org>
Gavin Barraclough wrote: > One way in which the proposal under discussion seems to differ from > the previous strawman is in the behavior arising from concatenation of > strings ending/beginning with a surrogate hi and lo element. > How do we want to handle how do we want to handle unpaired UTF-16 > surrogates in a full-unicode string? I can see three options: > > 1) Prohibit values from strings that do not map to valid unicode > characters (either throw an exception, or replace with the unicode > replacement character). > 2) Allow invalid unicode characters in strings, and preserve them over > concatenation – ("\uD800" + "\uDC00").length == 2. > 3) Allow invalid unicode characters in strings, but allow surrogate > pairs to fuse over concatenation – ("\uD800" + "\uDC00").length == 1. > > It seems desirable for full-unicode strings to logically be a sequence > of unicode characters, stored and processed in a encoding-agnostic > manner. option 3 would seem to violate that, exposing the underlying > UTF-16 implementation. It also loses a distributive property of > .length over concatenation that I believe is true in ES5 for strings, > in that currently for all strings s1 & s2: > s1.length + s2.length == (s1 + s2).length > However if we allow concatenation to fuse surrogate pairs into a > single character (e.g. s1 = "\uD800", s2 = "\uDC00") this will no > longer be true. > > I guess I wonder if it's worth considering either options 1) or 2) – > either prohibiting invalid unicode characters in strings, or consider > something closer to the previous strawman, where string storage is > defined to be 32-bit (with a BRS that instead of changing iteration > would change string creation, introducing an implicit UTF16-UTF32 > conversion). Great post. I agree 3 is not good. I was thinking based on today's exchanges that the BRS being set to "full Unicode" *could* mean that "\uXXXX" is illegal and you *must* use "\u{...}" to write Unicode *code points* (not code units). Last year we dispensed with the binary data hacking in strings use-case. I don't see the hardship. But rather than throw exceptions on concatenation I would simply eliminate the ability to spell code units with "\uXXXX" escapes. Who's with me? /be
Received on Monday, 20 February 2012 03:52:51 UTC