[whatwg] Surrogate pairs and character references

According to the spec, character references may cause surrogate  
characters (0xD800 to 0xDFFF) to be inserted into the DOM.  Assuming  
that the DOM is an UTF-16BE environment, �� and  
𐀀 will both result in \xD800\xDC00 or U+1,0000.  This should  
probably be pointed out explicitly since extra processing has to be  
done to achieve the same result in a parser that is not built atop  
UTF-16BE.

Furthermore, it is not entirely clear whether a mixed form like  
\xD800� encoded in UTF-16BE should give \xD800\xDC00 or \xFFFD 
\xDC00.  Not all browsers convert unpaired surrogates in UTF-16 to U 
+FFFD, so the mixed form may be interpreted as U+1,0000.

-- 
?istein E. Andersen

Received on Tuesday, 8 September 2009 15:39:03 UTC