- From: L. David Baron <dbaron@dbaron.org>
- Date: Tue, 30 Jun 2009 15:43:24 -0700
- To: public-webapps@w3.org, jwalden@mit.edu, jonas@sicking.cc, annevk@opera.com
On Wednesday 2009-06-17 16:26 +1000, Cameron McCormack wrote: > Jonas Sicking: > > Yes, I don't see how we could handle this in WebIDL, other than > > defining that all DOMStrings must be structurally correct UTF-16. > > However that would be prohibitively expensive since we would have to > > add checks in many many places. > > I agree, I don’t think it would be good to require this. > > Anne van Kesteren: > > Web IDL could define algorithms how you convert a DOMString to and > > from UTF-8. And maybe other encodings if that is desirable. > > I added a simple algorithm that converts a sequence of 16 bit code units > to a sequence of Unicode characters, inserting U+FFFD characters when > bad surrogates are used: > > http://dev.w3.org/2006/webapi/WebIDL/#dfn-obtain-unicode > > Nothing in Web IDL references this algorithm. Other specs can do so if > it is useful. This algorithm seems incorrect in two ways: * It gets the ranges for high and low surrogates backwards. (High surrogates are U+D800 - U+DBFF, low surrogates are U+DC00 - U+DFFF, and in UTF-16 a surrogate pair is a high surrogate followed by a low surrogate. So swapping the ranges in the headings should make the algorithm correct, modulo the next point. But you should definitely double-check this. :-) * It incorrectly handles unpaired high surrogates by eating the next character. Instead, I would prefer that the unpaired high surrogate is replaced by U+FFFD, and the following character is interpreted normally. (That's what Gecko does, anyway. Furthermore, I think it makes sense to match the handling of unpaired low surrogates.) -David -- L. David Baron http://dbaron.org/ Mozilla Corporation http://www.mozilla.com/
Received on Tuesday, 30 June 2009 22:44:17 UTC