- From: Řistein E. Andersen <liszt@coq.no>
- Date: Tue, 8 Sep 2009 23:39:03 +0100
According to the spec, character references may cause surrogate characters (0xD800 to 0xDFFF) to be inserted into the DOM. Assuming that the DOM is an UTF-16BE environment, �� and 𐀀 will both result in \xD800\xDC00 or U+1,0000. This should probably be pointed out explicitly since extra processing has to be done to achieve the same result in a parser that is not built atop UTF-16BE. Furthermore, it is not entirely clear whether a mixed form like \xD800� encoded in UTF-16BE should give \xD800\xDC00 or \xFFFD \xDC00. Not all browsers convert unpaired surrogates in UTF-16 to U +FFFD, so the mixed form may be interpreted as U+1,0000. -- ?istein E. Andersen
Received on Tuesday, 8 September 2009 15:39:03 UTC