[Bug 9989] Is the number of replacement characters supposed to be well-defined? If not this should be explicitly noted. If it is then more detail is required.

http://www.w3.org/Bugs/Public/show_bug.cgi?id=9989

Simon Pieters <simonp@opera.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|NEEDSINFO                   |

Ian 'Hixie' Hickson <ian@hixie.ch> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |ASSIGNED

--- Comment #7 from Simon Pieters <simonp@opera.com> 2010-09-27 11:22:36 UTC ---
It seems the bugzilla monster ate my comment. Trying again:


First try, probably isn't quite right:

Numbers are bytes in hex. "Anything but ..." includes EOF.

Stray 80-BF:
FE-FF:
replace with one U+FFFD.

C0-C1 followed by 80-BF:
replace the 2-byte sequence with one U+FFFD.

C0-FD followed by anything but 80-BF:
replace the first byte with one U+FFFD and reprocess the second byte.

E0-FD followed by 80-BF followed by anything but 80-BF:
replace the first two bytes with one U+FFFD and reprocess the third byte.

F0-FD followed by two 80-BF followed by anything but 80-BF:
replace the first three bytes with one U+FFFD and reprocess the forth byte.

F0-F4 followed by three 80-BF that represent a code point above U+10FFFF:
replace all four bytes with one U+FFFD.

F5-FD followed by three 80-BF followed by anything but 80-BF:
replace the first four bytes with one U+FFFD and reprocess the fifth byte.

FC-FD followed by four 80-BF followed by anything but 80-BF:
replace the first five bytes with one U+FFFD and reprocess the sixth byte.

Overlong forms (e.g. F0 80 80 A0):
replace the whole byte sequence with one U+FFFD.

--- Comment #8 from Ian 'Hixie' Hickson <ian@hixie.ch> 2010-09-28 07:29:37 UTC ---
Any volunteers for a Web UTF-8 spec?

I guess I'll put this in the HTML spec's infrastructure section and then refer
to it from all the other specs of relevance.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Received on Tuesday, 28 September 2010 07:42:28 UTC