[whatwg] Handling of illegal byte-sequences (typically in UTF-8) from Øistein E. Andersen on 2006-11-24 (public-whatwg-archive@w3.org from November 2006)

From: Øistein E. Andersen <html5@xn--istein-9xa.com>
Date: Fri, 24 Nov 2006 03:11:57 +0100
Message-ID: <E1GnQXl-0002VU-00@ws1.ou-data.net>

Section 8.1.4:
> Bytes that are not valid UTF-8 sequences must be interpreted as [...] U+FFFD

Section 9.2.2:
> Bytes or sequences of bytes [...] that could not be converted to Unicode characters
> must be converted to U+FFFD

If I read this correctly, section 8.1.4 requires that an illegal UTF-8 sequence like
F2 BF BF (the three first bytes of a four-byte sequence, obviously not followed by
a continuation byte) be converted into exactly three U+FFFD characters (one
for each byte), whereas section 9.2.2 also allows one single replacement character (and possibly even two) in this case (and permits an arbitrary number n of repetitions
of the three-byte sequence to be replaced by any number of U+FFFD characters
between 1 and 3n).

I realise that the underspecification in section 9.2.2 may well be intentional, given that
this section is not limited to UTF-8, but (quite possibly depending on the handling chosen) this 
can (more or less easily) be expressed in such a way that it applies to any encoding.

Alternatively, a reference to an authoritative source would of course fulfil the purpose in the particular case of UTF-8 (if such a document can be found).

[Currently, an alert reader might infer that the treatment indicated in section 8.1.4
would be preferable also in section 9.2.2, but such inference for consistency can
hardly be expected.]

-- 
??istein E. Andersen

Received on Thursday, 23 November 2006 18:11:57 UTC