- From: Philip Taylor <philip@zaynar.demon.co.uk>
- Date: Mon, 20 Aug 2007 16:32:38 +0100
- To: public-html WG <public-html@w3.org>
Cameron McCormack wrote:
> Robert Burns:
>>> I believe this is not consistent with existing browser behavior. That is
>>> that while surrogate pairs, expressed as pairs of numeric character
>>> references, are not supposed to resolve to the non-BMP character,
>>> browsers do it anyway.
>
> Anne van Kesteren:
>> Do you have any tests to demonstrate that?
>
> Here’s one:
>
> data:text/html,%26%23xD800%3B%26%23xDC00%3B
>
> Shows as a single U+10000 character in Firefox 2.0.0.5 and Opera 9.23,
> at least.
I also get a single character rendered in FF2, Opera 9.2, IE6, IE7 and
Safari 3 (Windows). I get two rendered U+FFFD characters in FF3 (build
2007081904).
There's less consistency in other edge cases:
http://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20HTML%3E%3Cp%3E0%3A%20%26%23xd800%3B%26%23xdc00%3B%3Cp%3E1%3A%20%26%23xd800%3B%3Cscript%3Edocument.write(%27\udc00%27)%3C/script%3E%3Cp%3E2%3A%20%3Cscript%3Edocument.write(%27\ud800%27)%3C/script%3E%26%23xdc00%3B%3Cp%3E3%3A%20%3Cscript%3Edocument.write(%27%26%23xd800%3B\udc00%27)%3C/script%3E
It's not obvious to me which cases should be handled in which way.
If "��" was to be handled as in everything except FF3, the
only straightforward implementation I can think of is like:
In the intro to the algorithm, add:
"A 'surrogate entity' flag is used to handle characters encoded as a
surrogate pair split over two numeric entities. It is either true or
false, and must initially be false. A 'high surrogate code point'
variable is used for the same purpose, and contains values in the range
0xD800 to 0xDBFF."
...
"Immediately before a token is emitted, if the 'surrogate entity'
flag is true, then a U+FFFD character token must be emitted and the
'surrogate entity' flag must be set to false."
In "Tokenising entities", change to:
"If it is a high surrogate code point (in the range 0xD800 to
0xDBFF), then: This is a parse error. If the 'surrogate entity' flag is
true, emit a U+FFFD character token. Set the 'surrogate entity' flag to
true, and set the 'high surrogate code point' to the entity's number.
Return no characters.
If it is a low surrogate code point (in the range 0xDC00 0xDFFF),
then: This is a parse error. If the 'surrogate entity' flag is false,
return a U+FFFD character token. If the 'surrogate entity' flag is true,
set the 'surrogate entity' flag to false and return a character token
whose code point is [however you combine high and low surrogates]."
The "Return no characters" isn't really correct, because it ought to be
treated as a successful consumption. The "emit ..." isn't correct
either, because it might be appending to an attribute value instead. It
may be easier if the "Tokenising entities" section said:
"If the entity is being consumed as part of an attribute, 'produce a
character' means the character is appended to the current attribute
value. Otherwise, it means the character is emitted as a character token."
and so the entity-consumption algorithm produces characters directly,
and just returns a boolean 'was successfully consumed' value.
--
Philip Taylor
philip@zaynar.demon.co.uk
Received on Monday, 20 August 2007 15:32:51 UTC