- From: Řistein E. Andersen <html5@xn--istein-9xa.com>
- Date: Fri, 24 Nov 2006 03:11:57 +0100
Section 8.1.4: > Bytes that are not valid UTF-8 sequences must be interpreted as [...] U+FFFD Section 9.2.2: > Bytes or sequences of bytes [...] that could not be converted to Unicode characters > must be converted to U+FFFD If I read this correctly, section 8.1.4 requires that an illegal UTF-8 sequence like F2 BF BF (the three first bytes of a four-byte sequence, obviously not followed by a continuation byte) be converted into exactly three U+FFFD characters (one for each byte), whereas section 9.2.2 also allows one single replacement character (and possibly even two) in this case (and permits an arbitrary number n of repetitions of the three-byte sequence to be replaced by any number of U+FFFD characters between 1 and 3n). I realise that the underspecification in section 9.2.2 may well be intentional, given that this section is not limited to UTF-8, but (quite possibly depending on the handling chosen) this can (more or less easily) be expressed in such a way that it applies to any encoding. Alternatively, a reference to an authoritative source would of course fulfil the purpose in the particular case of UTF-8 (if such a document can be found). [Currently, an alert reader might infer that the treatment indicated in section 8.1.4 would be preferable also in section 9.2.2, but such inference for consistency can hardly be expected.] -- ??istein E. Andersen
Received on Thursday, 23 November 2006 18:11:57 UTC