- From: poot <cvsmail@w3.org>
- Date: Thu, 03 Mar 2011 21:58:09 -0500
- To: public-html-diffs@w3.org
hixie: Fix the UTF-8 decoder error handling to handle a few errors I'd missed, including in particular surrogate halves. This may be a mistake; if I'm forgetting something please let me know so I can fix it. (e.g. did we decide not to catch surrogates or something?) (whatwg r5942) http://dev.w3.org/cvsweb/html5/spec/Overview.html?r1=1.4782&r2=1.4783&f=h http://html5.org/tools/web-apps-tracker?from=5941&to=5942 =================================================================== RCS file: /sources/public/html5/spec/Overview.html,v retrieving revision 1.4782 retrieving revision 1.4783 diff -u -d -r1.4782 -r1.4783 --- Overview.html 4 Mar 2011 02:10:50 -0000 1.4782 +++ Overview.html 4 Mar 2011 02:56:55 -0000 1.4783 @@ -3230,39 +3230,47 @@ <dl class="switch"><dt>One byte in the range FE to FF</dt> + <dt><a href="#overlong-form" title="overlong form">Overlong forms</a> (e.g. F0 80 80 A0)</dt> - <dt>One byte in the range C0 to C1, followed by one byte in the range 80 to BF</dt> + <dt>One byte in the range C0 to C1, followed by one byte in the range 80 to BF</dt> <!-- overlong ASCII (redundant with the previous line, really, but worth calling out separately as it's especially dangerous to miss this case) --> + <dt>One byte in the range F0 to F4, followed by three bytes in the range 80 to BF that represent a code point above U+10FFFF</dt> - <dt>One byte in the range F5 to F7, followed by three bytes in the range 80 to BF</dt> + <dt>One byte in the range F5 to F7, followed by three bytes in the range 80 to BF</dt> <!-- above U+10FFFF --> - <dt>One byte in the range F8 to FB, followed by four bytes in the range 80 to BF</dt> + <dt>One byte in the range F8 to FB, followed by four bytes in the range 80 to BF</dt> <!-- above U+10FFFF --> - <dt>One byte in the range FC to FD, followed by five bytes in the range 80 to BF</dt> + <dt>One byte in the range FC to FD, followed by five bytes in the range 80 to BF</dt> <!-- above U+10FFFF --> - <dt>One byte in the range E0 to FD, followed by a byte in the range 80 to BF, not followed by a byte in the range 80 to BF</dt> - <dt>One byte in the range F0 to FD, followed by two bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt> + <dt>One byte in the range C0 to FD that is not followed by a byte in the range 80 to BF</dt> <!-- too short --> - <dt>One byte in the range F8 to FD, followed by three bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt> + <dt>One byte in the range E0 to FD, followed by a byte in the range 80 to BF that is not followed by a byte in the range 80 to BF</dt> <!-- too short --> - <dt>One byte in the range FC to FD, followed by four bytes in the range 80 to BF, not followed by a byte in the range 80 to BF</dt> + <dt>One byte in the range F0 to FD, followed by two bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short --> + <dt>One byte in the range F8 to FD, followed by three bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short --> - <dd>The whole sequence must be replaced by a single U+FFFD + <dt>One byte in the range FC to FD, followed by four bytes in the range 80 to BF, the last of which is not followed by a byte in the range 80 to BF</dt> <!-- too short --> + + + <dt>Any byte sequence that represents a code point in the range U+D800 to U+DFFF</dt> <!-- surrogate halves --> + + + <dd>The whole matched sequence must be replaced by a single U+FFFD REPLACEMENT CHARACTER.</dd> <dt>One byte in the range 80 to BF not preceded by a byte in the range 80 to FD</dt> - <dt>A sequence of bytes in the range 80 to BF that does not follow a byte in the range C0 to FD</dt> + <dt>One byte in the range 80 to BF preceded by a byte that is part of a complete UTF-8 sequence that does not include this byte</dt> - <dt>One byte in the range C0 to FD not followed by a byte in the range 80 to BF</dt> + <dt>One byte in the range 80 to BF preceded by a byte that is part of a sequence that has been replaced by a U+FFFD REPLACEMENT CHARACTER, either alone or as port of a sequence</dt> + <dd>Each such byte must be replaced with a U+FFFD REPLACEMENT CHARACTER.</dd> - <dd>Each byte must be replaced with a U+FFFD REPLACEMENT CHARACTER.</dd> </dl><p>For the purposes of the above requirements, an <dfn id="overlong-form">overlong form</dfn> in UTF-8 is a sequence that encodes a code point using
Received on Friday, 4 March 2011 02:58:10 UTC