[whatwg] Surrogate pairs and character references from Øistein E. Andersen on 2009-09-08 (public-whatwg-archive@w3.org from September 2009)

From: Øistein E. Andersen <liszt@coq.no>
Date: Tue, 8 Sep 2009 23:39:03 +0100
Message-ID: <3CE0B6C3-1A6D-44AC-A9AA-D7C7B4C71363@coq.no>

According to the spec, character references may cause surrogate  
characters (0xD800 to 0xDFFF) to be inserted into the DOM.  Assuming  
that the DOM is an UTF-16BE environment, &#xD800;&#xDC00; and  
&#x10000; will both result in \xD800\xDC00 or U+1,0000.  This should  
probably be pointed out explicitly since extra processing has to be  
done to achieve the same result in a parser that is not built atop  
UTF-16BE.

Furthermore, it is not entirely clear whether a mixed form like  
\xD800&#xDC00; encoded in UTF-16BE should give \xD800\xDC00 or \xFFFD 
\xDC00.  Not all browsers convert unpaired surrogates in UTF-16 to U 
+FFFD, so the mixed form may be interpreted as U+1,0000.

-- 
?istein E. Andersen

Received on Tuesday, 8 September 2009 15:39:03 UTC