- From: Philip Taylor <philip@zaynar.demon.co.uk>
- Date: Mon, 20 Aug 2007 16:32:38 +0100
- To: public-html WG <public-html@w3.org>
Cameron McCormack wrote: > Robert Burns: >>> I believe this is not consistent with existing browser behavior. That is >>> that while surrogate pairs, expressed as pairs of numeric character >>> references, are not supposed to resolve to the non-BMP character, >>> browsers do it anyway. > > Anne van Kesteren: >> Do you have any tests to demonstrate that? > > Here’s one: > > data:text/html,%26%23xD800%3B%26%23xDC00%3B > > Shows as a single U+10000 character in Firefox 2.0.0.5 and Opera 9.23, > at least. I also get a single character rendered in FF2, Opera 9.2, IE6, IE7 and Safari 3 (Windows). I get two rendered U+FFFD characters in FF3 (build 2007081904). There's less consistency in other edge cases: http://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20HTML%3E%3Cp%3E0%3A%20%26%23xd800%3B%26%23xdc00%3B%3Cp%3E1%3A%20%26%23xd800%3B%3Cscript%3Edocument.write(%27\udc00%27)%3C/script%3E%3Cp%3E2%3A%20%3Cscript%3Edocument.write(%27\ud800%27)%3C/script%3E%26%23xdc00%3B%3Cp%3E3%3A%20%3Cscript%3Edocument.write(%27%26%23xd800%3B\udc00%27)%3C/script%3E It's not obvious to me which cases should be handled in which way. If "��" was to be handled as in everything except FF3, the only straightforward implementation I can think of is like: In the intro to the algorithm, add: "A 'surrogate entity' flag is used to handle characters encoded as a surrogate pair split over two numeric entities. It is either true or false, and must initially be false. A 'high surrogate code point' variable is used for the same purpose, and contains values in the range 0xD800 to 0xDBFF." ... "Immediately before a token is emitted, if the 'surrogate entity' flag is true, then a U+FFFD character token must be emitted and the 'surrogate entity' flag must be set to false." In "Tokenising entities", change to: "If it is a high surrogate code point (in the range 0xD800 to 0xDBFF), then: This is a parse error. If the 'surrogate entity' flag is true, emit a U+FFFD character token. Set the 'surrogate entity' flag to true, and set the 'high surrogate code point' to the entity's number. Return no characters. If it is a low surrogate code point (in the range 0xDC00 0xDFFF), then: This is a parse error. If the 'surrogate entity' flag is false, return a U+FFFD character token. If the 'surrogate entity' flag is true, set the 'surrogate entity' flag to false and return a character token whose code point is [however you combine high and low surrogates]." The "Return no characters" isn't really correct, because it ought to be treated as a successful consumption. The "emit ..." isn't correct either, because it might be appending to an attribute value instead. It may be easier if the "Tokenising entities" section said: "If the entity is being consumed as part of an attribute, 'produce a character' means the character is appended to the current attribute value. Otherwise, it means the character is emitted as a character token." and so the entity-consumption algorithm produces characters directly, and just returns a boolean 'was successfully consumed' value. -- Philip Taylor philip@zaynar.demon.co.uk
Received on Monday, 20 August 2007 15:32:51 UTC