hixie: Make surrogates in UTF-8 and character references turn into U+FFFD to prevent UTF-16 environments having hard-to-handle bugs. (whatwg r3871)

hixie: Make surrogates in UTF-8 and character references turn into
U+FFFD to prevent UTF-16 environments having hard-to-handle bugs.
(whatwg r3871)

http://dev.w3.org/cvsweb/html5/spec/Overview.html?r1=1.3035&r2=1.3036&f=h
http://html5.org/tools/web-apps-tracker?from=3870&to=3871

===================================================================
RCS file: /sources/public/html5/spec/Overview.html,v
retrieving revision 1.3035
retrieving revision 1.3036
diff -u -d -r1.3035 -r1.3036
--- Overview.html 16 Sep 2009 08:07:24 -0000 1.3035
+++ Overview.html 16 Sep 2009 09:16:20 -0000 1.3036
@@ -55883,23 +55883,25 @@
   motivated by a desire to increase the resilience of user agents in
   the face of na&iuml;ve transcoders.</p>
 
-  <p>All U+0000 NULL characters in the input must be replaced by
-  U+FFFD REPLACEMENT CHARACTERs. Any occurrences of such characters is
-  a <a href="#parse-error">parse error</a>.</p>
+  <p>All U+0000 NULL characters and characters in the range U+D800 to
+  U+DFFF<!-- surrogates not allowed e.g. in UTF-8, and we don't want
+  them to suddenly turn into codepoints when they go through a UTF-16
+  pipe --> in the input must be replaced by U+FFFD REPLACEMENT
+  CHARACTERs. Any occurrences of such characters is a <a href="#parse-error">parse
+  error</a>.</p>
 
   <p>Any occurrences of any characters in the ranges U+0001 to U+0008,
   <!-- HT, LF allowed --> <!-- U+000B is in the next list --> <!-- FF,
   CR allowed --> U+000E to U+001F, <!-- ASCII allowed --> U+007F
-  <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+D800
-  to U+DFFF<!-- surrogates not allowed -->, U+FDD0 to U+FDEF, and
-  characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE,
-  U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF,
-  U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE,
-  U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF,
-  U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and
-  U+10FFFF are <a href="#parse-error" title="parse error">parse errors</a>. (These
-  are all control characters or permanently undefined Unicode
-  characters.)</p>
+  <!--to U+0084, (U+0085 NEL not allowed), U+0086--> to U+009F, U+FDD0
+  to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
+  U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE,
+  U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF,
+  U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE,
+  U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF,
+  U+10FFFE, and U+10FFFF are <a href="#parse-error" title="parse error">parse
+  errors</a>. (These are all control characters or permanently
+  undefined Unicode characters.)</p>
 
   <p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
   characters are treated specially. Any CR characters that are
@@ -57734,9 +57736,11 @@
       <tr><td>0x9D <td>U+009D <td>&lt;control&gt;
       <tr><td>0x9E <td>U+017E <td>LATIN SMALL LETTER Z WITH CARON ('&#382;')
       <tr><td>0x9F <td>U+0178 <td>LATIN CAPITAL LETTER Y WITH DIAERESIS ('&Yuml;')
-    </table><p>Otherwise, if the number is greater than 0x10FFFF, then this is
-    a <a href="#parse-error">parse error</a>. Return a U+FFFD REPLACEMENT
-    CHARACTER.</p>
+    </table><p>Otherwise, if the number is in the range 0xD800 to 0xDFFF<!--
+    surrogates not allowed; see the comment in the "preprocessing the
+    input stream" section for details --> or is greater than 0x10FFFF,
+    then this is a <a href="#parse-error">parse error</a>. Return a U+FFFD
+    REPLACEMENT CHARACTER.</p>
 
     <p>Otherwise, return a character token for the Unicode character
     whose code point is that number.
@@ -57746,14 +57750,14 @@
     If the number is in the range 0x0001 to 0x0008, <!-- HT, LF
     allowed --> <!-- U+000B is in the next list --> <!-- FF, CR
     allowed --> 0x000E to 0x001F, <!-- ASCII allowed --> 0x007F <!--to
-    0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xD800 to
-    0xDFFF<!-- surrogates not allowed -->, 0xFDD0 to 0xFDEF, or is one
-    of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
-    0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
-    0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
-    0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
-    0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, or
-    0x10FFFF, then this is a <a href="#parse-error">parse error</a>.</p>
+    0x0084, (0x0085 NEL not allowed), 0x0086--> to 0x009F, 0xFDD0 to
+    0xFDEF, or is one of 0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF,
+    0x2FFFE, 0x2FFFF, 0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
+    0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF,
+    0x9FFFE, 0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE,
+    0xCFFFF, 0xDFFFE, 0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF,
+    0x10FFFE, or 0x10FFFF, then this is a <a href="#parse-error">parse
+    error</a>.</p>
 
    </dd>

Received on Wednesday, 16 September 2009 09:31:55 UTC