- From: Ian Hickson <ian@hixie.ch>
- Date: Fri, 15 Jun 2007 01:15:47 +0000 (UTC)
On Fri, 24 Nov 2006, ?istein E. Andersen wrote: > > Section 8.1.4: > > Bytes that are not valid UTF-8 sequences must be interpreted as [...] U+FFFD > > Section 9.2.2: > > Bytes or sequences of bytes [...] that could not be converted to Unicode characters > > must be converted to U+FFFD > > If I read this correctly, section 8.1.4 requires that an illegal UTF-8 > sequence like F2 BF BF (the three first bytes of a four-byte sequence, > obviously not followed by a continuation byte) be converted into exactly > three U+FFFD characters (one for each byte), whereas section 9.2.2 also > allows one single replacement character (and possibly even two) in this > case (and permits an arbitrary number n of repetitions of the three-byte > sequence to be replaced by any number of U+FFFD characters between 1 and > 3n). > > I realise that the underspecification in section 9.2.2 may well be > intentional, given that this section is not limited to UTF-8, but (quite > possibly depending on the handling chosen) this can (more or less > easily) be expressed in such a way that it applies to any encoding. > > Alternatively, a reference to an authoritative source would of course > fulfil the purpose in the particular case of UTF-8 (if such a document > can be found). > > [Currently, an alert reader might infer that the treatment indicated in > section 8.1.4 would be preferable also in section 9.2.2, but such > inference for consistency can hardly be expected.] On Fri, 24 Nov 2006, Henri Sivonen wrote: > > I'm inclined to think that interop in error situations doesn't need to > go as deep as defining how many replacement characters (in the range > 1...number of bytes in a faulty sequence) a character decoder has to > emit. Apps may want to delegate character decoding to an outside library > whose authors don't care about the details of HTML5. (For example, it > appears that Safari is leaving this stuff to ICU.) Chances are that > there's more value in being able to use a library than in getting a > specific number of replacement characters on error. On Sat, 25 Nov 2006, ?istein E. Andersen wrote: > > I agree. The current slight inconsistency should probably be amended by > making section 8.1.4 more liberal rather than the other way round. Done. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 14 June 2007 18:15:47 UTC