- From: Řistein E. Andersen <html5@xn--istein-9xa.com>
- Date: Sat, 23 Jun 2007 02:27:30 +0200
Ian Hickson wrote: > On Fri, 3 Nov 2006, Elliotte Harold wrote: > >> Section 9.2.2 of the current Web Apps 1.0 draft states: >> >>> Bytes or sequences of bytes in the original byte stream that could not >>> be converted to Unicode characters must be converted to U+FFFD >>> REPLACEMENT CHARACTER code points. >> >> [This does not specify the exact number of replacement chracters.] > > I don't really know how to define this. > I'd like to say that it's up to the encoding specifications > to define it. Any suggestions? Unicode 5.0 remains vague on this point. (E.g., definition D92 defines well-formed and ill-formed UTF-8 byte sequences, but conformance requirement C10 only requires ill-formed sequences to be treated as an error condition and suggests that a one-byte ill-formed sequence may be either filtered out or replaced by a U+FFFD replacement character.) More generally, character encoding specifications can hardly be expected to define proper error handling, since they are usually not terribly preoccupied with mislabelled data. Henri Sivonen has pointed out that a strict requirement on the number of replacement characters generated may cause unnecessary incompatibilities with current browsers and extant tools. The current text may nevertheless be two liberal. It would notably be possible to construct an arbitrarily long Chinese text in a legacy encoding which -- according to the spec -- could be replaced by one single U+FFFD replacement character if incorrectly handled as UTF-8, which might lead the user to think that the page is completely uninteresting and therefore move on, whereas a larger number of replacement characters would have led him to try another encoding. (This is only a problem, of course, if an implementor chooses to emit the minimal number of replacement characters sanctioned by the spec.) The current upper bound (number of bytes replaced) seems intuitive and completely harmless. A meaningful lower bound is less obvious, at least if we want to give some leeway to different implementations. http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt details an approach for UTF-8 that basically emits a replacement character and removes read bytes from the buffer each time a minimal malformed byte sequence has been detected. Safari, Opera and Firefox all mostly follow this, whereas IE7 usually emits one replacement character per replaced byte. (Interesting cases include byte sequences encoding forbidden characters like U+FFFF mod U+1,0000 or exceeding U+10,FFFF.) It should be relatively simple to define something like this for any multi-byte encoding, but perhaps less straightforward for encodings using escape sequences to switch between different alphabets or other more exotic encodings -- if we have to worry about those. -- ?istein E. Andersen
Received on Friday, 22 June 2007 17:27:30 UTC