- From: Ian Hickson <ian@hixie.ch>
- Date: Tue, 15 Sep 2009 02:03:54 +0000 (UTC)
On Tue, 8 Sep 2009, ?istein E. Andersen wrote: > > According to the spec, character references may cause surrogate characters > (0xD800 to 0xDFFF) to be inserted into the DOM. Assuming that the DOM is an > UTF-16 environment, �� and 𐀀 will both result in > \xD800\xDC00 or U+1,0000. This should probably be pointed out explicitly > since extra processing has to be done to achieve the same result in a parser > that is not built atop UTF-16. Actually it's the other way around. Extra work has to be done in UTF-16 environments to make sure that Unicode characters in the surrogate character range don't get processed as surrogate characters. (That is, regardless of the environment, �� and 𐀀 are not the same -- the first has two invalid characters U+D800 and U+DC00, the second has one character U+10000.) I'm not really sure how to make that clearer in the spec. I suppose we could just change the spec and say that surrogate characters (whether literal characters, e.g. in UTF-8, or from character references) all get converted to U+FFFD?. > Furthermore, it is not entirely clear whether a mixed form like \xD800� > encoded in UTF-16BE should give \xD800\xDC00 or \xFFFD\xDC00. It should give U+FFFD U+DC00. It's not clear to me why that is not clear. :-) Could you walk me through the spec interpreting it in such a way that you get any other result? > Not all browsers convert unpaired surrogates in UTF-16 to U+FFFD, so the > mixed form may be interpreted as U+1,0000. The spec says "Bytes or sequences of bytes in the original byte stream that could not be converted to Unicode characters must be converted to U+FFFD REPLACEMENT CHARACTER code points". -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Monday, 14 September 2009 19:03:54 UTC