numeric character references and Unicode surrogate pairs: part of my review of 8 The HTML syntax from Robert Burns on 2007-08-19 (public-html@w3.org from August 2007)

From: Robert Burns <rob@robburns.com>
Date: Sun, 19 Aug 2007 05:05:15 -0500
To: public-html WG <public-html@w3.org>
Message-Id: <39AF5F3D-BF44-48D6-8C5A-68DB6184735F@robburns.com>

Regarding diff:

<http://html5.org/tools/web-apps-tracker?from=942&to=943>

"Otherwise, if the number is zero, if the number is higher than  
0x10FFFF, or if it's one of the surrogate characters (characters in  
the range 0xD800 to 0xDFFF), then this is a parse error; return a  
character token for the U+FFFD REPLACEMENT CHARACTER character instead."

I believe this is not consistent with existing browser behavior. That  
is that while surrogate pairs, expressed as pairs of numeric  
character references, are not supposed to resolve to the non-BMP  
character, browsers do it anyway.

So while I think we should count this as a parse error, we may want  
to include it in a list of parse errors that are handled differently  
by different browsers.

I think this would be the best procedure for our WG to follow. For  
every parse error in the draft, we should maintain a list. Then we  
should produce results for how this error is currently handled in top- 
of-tree versions of the various browsers. Then I think we'll be in a  
better position to decide how HTML5 should recommend interoperable  
error-handling in each case. Obviously we may still have to decide  
between conflicting implementations, but at least we can do that  
through proper deliberation and consensus building steps.

Received on Sunday, 19 August 2007 10:06:25 UTC