[Bug 17151] How should UTF-16BE "\xD8\x00" be decoded? This is an ill-formed UTF-16 code unit sequence, but it can be converted to a Unicode code point. Firefox/Opera currently convert it to U+FFFD, which seems like the preferred behaviour.

https://www.w3.org/Bugs/Public/show_bug.cgi?id=17151

Geoffrey Sneddon <geoffers+w3cbugs@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |geoffers+w3cbugs@gmail.com

--- Comment #1 from Geoffrey Sneddon <geoffers+w3cbugs@gmail.com> 2012-05-22 17:23:15 UTC ---
This pertains to the following:

> Bytes or sequences of bytes in the original byte stream that could not be converted to Unicode code points must be converted to U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is UTF-8, the bytes must be decoded with the error handling defined in this specification.

> Note: Bytes or sequences of bytes in the original byte stream that did not conform to the encoding specification (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors that conformance checkers are expected to report.

"\xD8\x00" can obviously be decoded as if it were UTF-16BE to HTML5's
definition of a "Unicode code point" (which include lone surrogates), but
according to the Unicode specification it is an invalid UTF-16 code unit
sequence.

It would seem preferable that lone surrogates get converted to U+FFFD as they
currently are in Opera/Firefox.

-- 
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Tuesday, 22 May 2012 17:23:41 UTC