W3C home > Mailing lists > Public > whatwg@whatwg.org > August 2013

[whatwg] Handling of invalid UTF-8

From: Cameron Zemek <grom358@gmail.com>
Date: Fri, 30 Aug 2013 08:29:40 +1000
Message-ID: <CAJnenoXWF4e_1p4xb8GkmJO+_GSyCvUdw=APLarNeSgSf8XC_A@mail.gmail.com>
To: whatwg@whatwg.org
In the spec preview it had a section about UTF-8 decoding and the handling
of invalid byte sequences,
http://dev.w3.org/html5/spec-preview/infrastructure.html#utf-8 . But I have
noticed this section has been removed from the current version. So what
algorithm is used for handling of invalid UTF-8 byte sequences? Or this no
longer part of the HTML 5 specification?

My testing on firefox and chrome seems to indicate that they follow the
algorithm of replacing the first byte of an invalid sequence with the
replacement
character <http://en.wikipedia.org/wiki/Replacement_character> "�" (U+FFFD)
and then continue with the parsing of the next byte.
Received on Thursday, 29 August 2013 22:30:11 UTC

This archive was generated by hypermail 2.3.1 : Monday, 13 April 2015 23:09:23 UTC