- From: Erik Corry <erik.corry@gmail.com>
- Date: Thu, 1 Mar 2012 13:58:04 +0100
- To: "Phillips, Addison" <addison@lab126.com>
- Cc: Mark Davis ? <mark@macchiato.com>, Cameron McCormack <cam@mcc.id.au>, "public-script-coord@w3.org" <public-script-coord@w3.org>, Brendan Eich <brendan@mozilla.com>, Anne van Kesteren <annevk@opera.com>, "mranney@voxer.com" <mranney@voxer.com>, es-discuss <es-discuss@mozilla.org>
I'm not in favour of big red switches, and I don't think the compartment based solution is going to be workable. I'd like to plead for a solution rather like the one Java has, where strings are sequences of UTF-16 codes and there are specialized ways to iterate over them. Looking at this entry from the Unicode FAQ: http://unicode.org/faq/char_combmark.html#7 there are different ways to describe the length (and iteration) of a string. The BRS proposal favours #2, but I think for most applications utf-16-based-#1 is just fine, and for the applications that want to "do it right" #3 is almost always the correct solution. Solution #3 needs library support in any case and has no problems with UTF-16. The central point here is that there are combining characters (accents) that you can't just normalize away. Getting them right has a lot of the same issues as surrogate pairs (you shouldn't normally chop them up, they count as one 'character', you can't tell how many of them there are in a string without looking, etc.). If you can handle combining characters then the surrogate pair support falls out pretty much for free. Advantages of my proposal: * High level of backwards compatibility * No issues of where to place the BRS * Compact and simple in the implementation * Can be polyfilled on most VMs * Interaction with the DOM is unproblematic * No issues of what happens on concatenation if a surrogate pair is created. Details: * The built in string charCodeAt, [], length operations work in terms of UTF-16 * String.fromCharCode(x) can return a string with a length of 2 * New object StringIterator new StringIterator(backing) returns a string iterator. The iterator has the following methods: hasNext(); // Returns this.index() != this.backing().length nextGrapheme(); // Returns the next grapheme as a unicode code point, or -1 if the next grapheme is a sequence of code points nextGraphemeArray(); // Returns an array of numeric code points (possibly just one) representing the next grapheme nextCodePoint(); // Returns the next code point, possibly consuming two surrogate pairs index(); // Gets the current index in the string, from 0 to length setIndex(); // Sets the current index in the string, from 0 to length backing(); // Get the backing string // Optionally hasPrevious(); previous*(); // Analogous to nextGrapheme etc. codePointLength(); // Takes O(length), cache the answer if you care graphemeLength(); // Ditto If any of the next.. functions encounter an unmatched half of a surrogate pair they just return its number. Regexp support. Regexps act 'as if' the following steps were performed. Outside character classes an extended character turns into (?:xy) where x and y are the surrogate pairs. Inside positive character classes the extended characters are extracted so [abz] becomes (?:[ab]|xy) where z is an extended character and x and y are the surrogate pairs. Negative character classes can be handled by transforming into negative lookaheads. A decent set of unicode character classes will likely subsume most uses of these transformations. Perhaps the BRS 21 bit solution feels marginally cleaner, but having two different kinds of strings in the same VM feels like a horrible solution that is user visible and will haunt implementations forever, and the cleanliness difference is very marginal given that grapheme based iteration is the correct solution for almost all the cases where iterating over utf-16 codes is not good enough. -- Erik Corry 2012/2/20 Phillips, Addison <addison@lab126.com>: > Mark wrote: > > > > First, it would be great to get full Unicode support in JS. I know that's > been a problem for us at Google. > > > > AP> +1: I think we’ve waited for supplementary character support long > enough! > > > > Secondly, while I agree with Addison that the approach that Java took is > workable, it does cause problems. > > > > AP> The tension is between “compatibility” and “ease of use” here, I think. > The question is whether very many scripts depend on the ‘uint16’ nature of a > character in ES, use surrogates to effect supplementary character support, > or are otherwise tied to the existing encoding model and are broken as a > result of changes. In its ideal form, an ES string would logically be a > sequence of Unicode characters (code points) and only the internal > representation would worry about whatever character encoding scheme made the > most sense (in many cases, this might actually be UTF-16). > > > > AP> … but what I think is hard to deal with are different modes of > processing scripts depending on “fullness of the Unicode inside”. > Admittedly, the approach I favor is rather conservative and presents a > number of challenges, most notably in adapting regex or for users who want > to work strictly in terms of character values. > > > > There are good reasons for why Java did what it did, basically for > compatibility. But if there is some way that JS can work around those, > that'd be great. > > > > AP> Yes, it would. > > > > ~Addison > > > > > _______________________________________________ > es-discuss mailing list > es-discuss@mozilla.org > https://mail.mozilla.org/listinfo/es-discuss >
Received on Thursday, 1 March 2012 12:58:38 UTC