Re: numeric character references and Unicode surrogate pairs: part of my review of 8 The HTML syntax from Philip Taylor on 2007-08-20 (public-html@w3.org from August 2007)

From: Philip Taylor <philip@zaynar.demon.co.uk>
Date: Mon, 20 Aug 2007 16:32:38 +0100
To: public-html WG <public-html@w3.org>
Message-ID: <46C9B416.5090503@zaynar.demon.co.uk>

Cameron McCormack wrote:
> Robert Burns:
>>> I believe this is not consistent with existing browser behavior. That is  
>>> that while surrogate pairs, expressed as pairs of numeric character  
>>> references, are not supposed to resolve to the non-BMP character,  
>>> browsers do it anyway.
> 
> Anne van Kesteren:
>> Do you have any tests to demonstrate that?
> 
> Here’s one:
> 
>   data:text/html,%26%23xD800%3B%26%23xDC00%3B
> 
> Shows as a single U+10000 character in Firefox 2.0.0.5 and Opera 9.23,
> at least.

I also get a single character rendered in FF2, Opera 9.2, IE6, IE7 and 
Safari 3 (Windows). I get two rendered U+FFFD characters in FF3 (build 
2007081904).

There's less consistency in other edge cases: 
http://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20HTML%3E%3Cp%3E0%3A%20%26%23xd800%3B%26%23xdc00%3B%3Cp%3E1%3A%20%26%23xd800%3B%3Cscript%3Edocument.write(%27\udc00%27)%3C/script%3E%3Cp%3E2%3A%20%3Cscript%3Edocument.write(%27\ud800%27)%3C/script%3E%26%23xdc00%3B%3Cp%3E3%3A%20%3Cscript%3Edocument.write(%27%26%23xd800%3B\udc00%27)%3C/script%3E

It's not obvious to me which cases should be handled in which way.

If "&#xd800;&#xdc00;" was to be handled as in everything except FF3, the 
only straightforward implementation I can think of is like:

   In the intro to the algorithm, add:
   "A 'surrogate entity' flag is used to handle characters encoded as a 
surrogate pair split over two numeric entities. It is either true or 
false, and must initially be false. A 'high surrogate code point' 
variable is used for the same purpose, and contains values in the range 
0xD800 to 0xDBFF."
   ...
   "Immediately before a token is emitted, if the 'surrogate entity' 
flag is true, then a U+FFFD character token must be emitted and the 
'surrogate entity' flag must be set to false."

   In "Tokenising entities", change to:
   "If it is a high surrogate code point (in the range 0xD800 to 
0xDBFF), then: This is a parse error. If the 'surrogate entity' flag is 
true, emit a U+FFFD character token. Set the 'surrogate entity' flag to 
true, and set the 'high surrogate code point' to the entity's number. 
Return no characters.
    If it is a low surrogate code point (in the range 0xDC00 0xDFFF), 
then: This is a parse error. If the 'surrogate entity' flag is false, 
return a U+FFFD character token. If the 'surrogate entity' flag is true, 
set the 'surrogate entity' flag to false and return a character token 
whose code point is [however you combine high and low surrogates]."

The "Return no characters" isn't really correct, because it ought to be 
treated as a successful consumption. The "emit ..." isn't correct 
either, because it might be appending to an attribute value instead. It 
may be easier if the "Tokenising entities" section said:
   "If the entity is being consumed as part of an attribute, 'produce a 
character' means the character is appended to the current attribute 
value. Otherwise, it means the character is emitted as a character token."
and so the entity-consumption algorithm produces characters directly, 
and just returns a boolean 'was successfully consumed' value.

-- 
Philip Taylor
philip@zaynar.demon.co.uk

Received on Monday, 20 August 2007 15:32:51 UTC