[Bug 9663] Should sequences of bytes be replaced by a single U+FFFD, or one U+FFFD per input byte?

http://www.w3.org/Bugs/Public/show_bug.cgi?id=9663


Simon Pieters <simonp@opera.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|NEEDSINFO                   |




--- Comment #5 from Simon Pieters <simonp@opera.com>  2010-09-13 18:42:23 ---
First try, probably isn't quite right:

Numbers are bytes in hex. "Anything but ..." includes EOF.

Stray 80-BF:
FE-FF:
replace with one U+FFFD.

C0-C1 followed by 80-BF:
replace the 2-byte sequence with one U+FFFD.

C0-FD followed by anything but 80-BF:
replace the first byte with one U+FFFD and reprocess the second byte.

E0-FD followed by 80-BF followed by anything but 80-BF:
replace the first two bytes with one U+FFFD and reprocess the third byte.

F0-FD followed by two 80-BF followed by anything but 80-BF:
replace the first three bytes with one U+FFFD and reprocess the forth byte.

F0-F4 followed by three 80-BF that represent a code point above U+10FFFF:
replace all four bytes with one U+FFFD.

F5-FD followed by three 80-BF followed by anything but 80-BF:
replace the first four bytes with one U+FFFD and reprocess the fifth byte.

FC-FD followed by four 80-BF followed by anything but 80-BF:
replace the first five bytes with one U+FFFD and reprocess the sixth byte.

Overlong forms (e.g. F0 80 80 A0):
replace the whole byte sequence with one U+FFFD.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Monday, 13 September 2010 18:42:25 UTC