- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Fri, 24 Nov 2006 12:33:22 +0200
On Nov 24, 2006, at 04:11, ?istein E. Andersen wrote: > Section 8.1.4: >> Bytes that are not valid UTF-8 sequences must be interpreted as >> [...] U+FFFD > > Section 9.2.2: >> Bytes or sequences of bytes [...] that could not be converted to >> Unicode characters >> must be converted to U+FFFD > > If I read this correctly, section 8.1.4 requires that an illegal > UTF-8 sequence like > F2 BF BF (the three first bytes of a four-byte sequence, obviously > not followed by > a continuation byte) be converted into exactly three U+FFFD > characters (one > for each byte), whereas section 9.2.2 also allows one single > replacement character (and possibly even two) in this case (and > permits an arbitrary number n of repetitions > of the three-byte sequence to be replaced by any number of U+FFFD > characters > between 1 and 3n). I'm inclined to think that interop in error situations doesn't need to go as deep as defining how many replacement characters (in the range 1...number of bytes in a faulty sequence) a character decoder has to emit. Apps may want to delegate character decoding to an outside library whose authors don't care about the details of HTML5. (For example, it appears that Safari is leaving this stuff to ICU.) Chances are that there's more value in being able to use a library than in getting a specific number of replacement characters on error. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Friday, 24 November 2006 02:33:22 UTC