html5/spec infrastructure.html,1.1052,1.1053 spec.html,1.1402,1.1403 from Michael Smith via cvs-syncmail on 2011-03-04 (public-html-commits@w3.org from March 2011)

From: Michael Smith via cvs-syncmail <cvsmail@w3.org>
Date: Fri, 04 Mar 2011 03:48:11 +0000
To: public-html-commits@w3.org
Message-Id: <E1PvLzv-0002YE-R7@lionel-hutz.w3.org>
Update of /sources/public/html5/spec
In directory hutz:/tmp/cvs-serv9690

Modified Files:
	infrastructure.html spec.html 
Log Message:
Fix the UTF-8 decoder error handling to handle a few errors I'd missed, including in particular surrogate halves. This may be a mistake; if I'm forgetting something please let me know so I can fix it. (e.g. did we decide not to catch surrogates or something?) (whatwg r5942)

[updated by splitter]


Index: infrastructure.html
===================================================================
RCS file: /sources/public/html5/spec/infrastructure.html,v
retrieving revision 1.1052
retrieving revision 1.1053
diff -u -d -r1.1052 -r1.1053
--- infrastructure.html	4 Mar 2011 02:46:34 -0000	1.1052
+++ infrastructure.html	4 Mar 2011 03:48:09 -0000	1.1053
@@ -1262,39 +1262,47 @@
 
   </p><dl class="switch"><dt>One byte in the range FE to FF</dt>
 
+
    <dt><a href="#overlong-form" title="overlong form">Overlong forms</a> (e.g. F0 80 80 A0)</dt>
 
-   <dt>One byte in the range C0 to C1, followed by one byte in the range 80 to BF</dt>
+   <dt>One byte in the range C0 to C1, followed by one byte in the range 80 to BF</dt> <!-- overlong ASCII (redundant with the previous line, really, but worth calling out separately as it's especially dangerous to miss this case) -->
+
 
    <dt>One byte in the range F0 to F4, followed by three bytes in the range 80 to BF that represent a code point above U+10FFFF</dt>
 
-   <dt>One byte in the range F5 to F7, followed by three bytes in the range 80 to BF</dt>
+   <dt>One byte in the range F5 to F7, followed by three bytes in the range 80 to BF</dt> <!-- above U+10FFFF -->
 
-   <dt>One byte in the range F8 to FB, followed by four bytes in the range 80 to BF</dt>
+   <dt>One byte in the range F8 to FB, followed by four bytes in the range 80 to BF</dt> <!-- above U+10FFFF -->
 
-   <dt>One byte in the range FC to FD, followed by five bytes in the range 80 to BF</dt>
+   <dt>One byte in the range FC to FD, followed by five bytes in the range 80 to BF</dt> <!-- above U+10FFFF -->
 
-   <dt>One byte in the range E0 to FD, followed by a byte in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>
 
-   <dt>One byte in the range F0 to FD, followed by two bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>
+   <dt>One byte in the range C0 to FD that is not followed by a byte in the range 80 to BF</dt> <!-- too short -->
 
-   <dt>One byte in the range F8 to FD, followed by three bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>
+   <dt>One byte in the range E0 to FD, followed by a byte in the range 80 to BF that is not followed by a byte in the range 80 to BF</dt> <!-- too short -->
 
-   <dt>One byte in the range FC to FD, followed by four bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt>
+   <dt>One byte in the range F0 to FD, followed by two bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short -->
 
+   <dt>One byte in the range F8 to FD, followed by three bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short -->
 
-   <dd>The whole sequence must be replaced by a single U+FFFD
+   <dt>One byte in the range FC to FD, followed by four bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short -->
+
+
+   <dt>Any byte sequence that represents a code point in the range U+D800 to U+DFFF</dt> <!-- surrogate halves -->
+
+
+   <dd>The whole matched sequence must be replaced by a single U+FFFD
    REPLACEMENT CHARACTER.</dd>
 
 
    <dt>One byte in the range 80 to BF not preceded by a byte in the range 80 to FD</dt>
 
-   <dt>A sequence of bytes in the range 80 to BF that does not follow a byte in the range C0 to FD</dt>
+   <dt>One byte in the range 80 to BF preceded by a byte that is part of a complete UTF-8 sequence that does not include this byte</dt>
 
-   <dt>One byte in the range C0 to FD not followed by a byte in the range 80 to BF</dt>
+   <dt>One byte in the range 80 to BF preceded by a byte that is part of a sequence that has been replaced by a U+FFFD REPLACEMENT CHARACTER, either alone or as port of a sequence</dt>
 
+   <dd>Each such byte must be replaced with a U+FFFD REPLACEMENT CHARACTER.</dd>
 
-   <dd>Each byte must be replaced with a U+FFFD REPLACEMENT CHARACTER.</dd>
 
   </dl><p>For the purposes of the above requirements, an <dfn id="overlong-form">overlong
   form</dfn> in UTF-8 is a sequence that encodes a code point using

Index: spec.html
===================================================================
RCS file: /sources/public/html5/spec/spec.html,v
retrieving revision 1.1402
retrieving revision 1.1403
diff -u -d -r1.1402 -r1.1403
--- spec.html	4 Mar 2011 02:46:34 -0000	1.1402
+++ spec.html	4 Mar 2011 03:48:09 -0000	1.1403
@@ -369,7 +369,7 @@
     <a href="Overview.html">single page HTML</a>,
     <a href="spec.html">multipage HTML</a>,
     <a href="author/">web developer edition</a>.
-This is revision 1.4782.
+This is revision 1.4783.
    </p> 
      <p class="copyright"><a href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a>
    &#169; 2010 <a href="http://www.w3.org/"><abbr title="World Wide
Received on Friday, 4 March 2011 03:48:13 UTC